213

213 RR Team Dynamics, API Design, and System Resiliency with Daniel Jacobson of Netflix


 02:25 – Daniel Jacobson Introduction

02:46 – How Netflix Looks at Programming and Development Team

05:03 – Maintaining a Consistent Culture

  • Setting Context

06:37 – Onboarding Process

08:15 – Engineering spirals: 10 philosophies to facilitate innovation

  • Introspection
  • Transformations:
    • Staffed Up
    • Solving the Resiliency Problem

15:04 – Making Space for Innovation

  • Building Expectations
  • Incrementing Deliverables
  • Building Trust and Confidence         
  • Maintenance Mode

23:12 – APIs

29:32 – Solving Real Problems, The Groovy Layer

31:34 – hystrix and Patterns for Making Systems Resilient

39:14 – RxJava

41:17 – The Dynamic of Senior Engineers

Screening Process

44:02 – Conway's Law

47:44 – Best and Most Challenging Parts About Working for Netflix

  • Scaling and Maintaining 

Picks

Fund Club (Coraline)
The Codeless Code (Avdi)
Trotro (Avdi)
Serial Podcast (Chuck)
Happy Father’s Day! (Chuck)

RailsClips (Chuck)
StartUp (Daniel)
Reply All (Daniel)
Mystery Show (Daniel)
Chris Messina: Seeking Genius in Negative Space (Daniel)
Chris Messina: Full Stack Employee (Daniel)
Netflix Techblog (Daniel)
Netflix GitHub (Daniel)

This episode is sponsored by

comments powered by Disqus

TRANSCRIPT

AVDI:  I thought it was interesting that she had a concept of time zones.

CORALINE:  She could be a programmer.

[This episode is sponsored by Hired.com. Every week on Hired, they run an auction where over a thousand tech companies in San Francisco, New York, and L.A. bid on Ruby developers, providing them with salary and equity upfront. The average Ruby developer gets an average of 5 to 15 introductory offers and an average salary offer of $130,000 a year. Users can either accept an offer and go right into interviewing with the company or deny them without any continuing obligations. It’s totally free for users. And when you’re hired, they also give you a $2,000 signing bonus as a thank you for using them. But if you use the Ruby Rogues link, you’ll get a $4,000 bonus instead. Finally, if you’re not looking for a job and know someone who is, you can refer them to Hired and get a $1,337 bonus if they accept a job. Go sign up at Hired.com/RubyRogues.]

[This episode is sponsored by Codeship.com. Codeship is a hosted continuous delivery service focusing on speed, security and customizability. You can set up continuous integration in a matter of seconds and automatically deploy when your tests have passed. Codeship supports your GitHub and Bitbucket projects. You can get started with Codeship’s free plan today. Should you decide to go with a premium plan, you can save 20% off any plan for the next three months by using the code RubyRogues.]

[Snap is a hosted CI and continuous delivery that is simple and intuitive. Snap’s deployment pipelines deliver fast feedback and can push healthy builds to multiple environments automatically or on demand. Snap integrates deeply with GitHub and has great support for different languages, data stores, and testing frameworks. Snap deploys your application to cloud services like Heroku, Digital Ocean, AWS, and many more. Try Snap for free. Sign up at SnapCI.com/RubyRogues.]

[This episode is sponsored by DigitalOcean. DigitalOcean is the provider I use to host all of my creations. All the shows are hosted there along with any other projects I come up with. Their user interface is simple and easy to use. Their support is excellent and their VPS’s are backed on Solid State Drives and are fast and responsive. Check them out at DigitalOcean.com. If you use the code RubyRogues you’ll get a $10 credit.]

CHUCK:  Hey everybody and welcome to episode 213 of the Ruby Rogues podcast. This week on our panel we have Avdi Grimm.

AVDI:  Hello from Tennessee.

CHUCK:  Coraline Ada Ehmke.

CORALINE:  Hello listeners.

CHUCK:  I’m Charles Max Wood from DevChat.tv. And we have a special guest this week, and that is Daniel Jacobson. Do you want to introduce yourself real quick, Daniel?

DANIEL:  Sure. I head up the Edge engineering team at Netflix which is basically responsible for the API and playback as well as a range of tools and [insights teams]. And basically we’re the front door to entire Netflix service.

CHUCK:  Cool. So, I’m not sure where we want to get started. There are a lot of different things that we could talk about here. Do you want to give us an overview of how Netflix looks at programming, looks at their development team?

DANIEL:  Yeah. So, it’s very hinged to our culture. We have our public culture slides. And the core premises of the slides are context and control, and freedom and responsibility. So, what we do is we set up a range of distributed engineering teams that specialize in certain areas. And each of those engineering teams have the freedom to make the choices and figure out what the problem set and the solutions would be for that problem space. And each team builds the systems based on what they want to do there. And they make the choices on which technologies to use.

There are a couple of key tenets on the technology front that drive some of this, though. That all came to me to at least figure out how to adhere to. One is our entire system is in AWS. So, we all operate in that sphere. And then another is we tend to operate in the JVM. So, very little if any Ruby in our environment. But within those constructs for the most part, each team is making the choices that they want to support their specialty. And we tend to focus very much on specialties. And that’s a very important concept.

So, a team is, this is tied to the microservices model. But that team who’s focusing on recommendations, they hire very deep specialists, senior level engineers across the board in each of those, like in the recommendations world as an example. And they go solve that problem and expose interfaces out to other teams so that they can consume the output of that. And then other teams such as my team will be the consumer of that. And we specialize in brokering data and ensuring scalability and resilience. And we consume a lot of data, ensure that we can be available to the devices. Other teams, like the metadata team specializes in creating a robust data set for all of our titles that we have. Others such as the rating system, very deep specialty around algorithms pertaining to ratings and so on. So, that’s at a high level how we operate and how we structure our teams here.

CORALINE:  With distributed teams, how do you make sure that you’re maintaining a consistent culture?

DANIEL:  So, the culture is actually pervasive throughout all of… everybody who joins the company will be screened very much for the culture. And we have a range of values that we screen for. And we need to make sure that that is very important on entry point. And then of course context from leadership needs to be very well distributed. So, we spend a lot of time on setting context on how we expect people to behave, how we expect them to build their systems and operate their systems. Of course, it’s up to them on how they’re going to actually execute it. But it’s a leadership task and I guess a recruiting and screening task to ensure that people are coming in with the expectation and operating with the expectation of adhering to the culture.

And going back to the freedom and responsibility of this, part of the context is go make the right choices but make sure that they’re informed choices, that you’re talking to the right people, that you’re communicating effectively, being sensitive to what other teams need. And then being responsible for those decisions that you’re making. And then of course it’s up to us to make sure that, ‘us’ being leadership, to course-correct as people are going. So, each team does have maybe its own little subculture but it’s all framed around the freedom and responsibility and context and control. It’s an interesting question. You could see an opportunity to degrade. But I feel like it actually lives and operates pretty well here.

CORALINE:  You said you hire senior people across all these teams. What does the onboarding process look like?

DANIEL:  You mean how do we get them up to speed?

CORALINE:  Yes.

DANIEL:  Well, team by team it’s going to be different. There are going to be different complexities. But I think by and large people are coming in with great knowledge in their space. So, not only are they really strong Java engineers, they also understand how to interact with other people. They’re mature. They understand how the build processes work and those kinds of things. So, the steps to get someone there is probably smaller than if you hire someone fresh out of college.

That said, each system has its own codebase. There are a lot of complexities at scale. A lot of people who come in had not dealt with the scale the size of Netflix. So, what we tend to do is we tend to start them out by figuring out how to build a dev environment on a box, get moving on going through the code. There’s obviously a lot of documentation, confluence pages and docs that they can consume along the way, ensuring that everyone on the team is open for questions, encouraging the new hire to ask a ton of questions and don’t be bashful about that. And within that framework, get building, start tinkering with the code, give them small tasks that they can actually execute on, eventually get things into production and then slowly broaden out the sphere.

So, we don’t have a bootcamp like Facebook does or like deep training processes that we put everyone through. It’s more, let’s get them in there. Let’s get them tinkering with the code and slow start to expand their scope.

AVDI:  You wrote something really interesting about team culture. I think the default assumption we have when we look at a team is what its culture is, what its atmosphere is, is static. What it is now is what it’s going to be. But you wrote that teams are always either spiraling up or spiraling down. Could you explain that?

DANIEL:  Yeah. I think a lot of engineering managers can potentially fall into a trap of like you described, persisting in trying to solve use cases that they were set up for. And I think the issue there is that a lot of teams therefore figure that they understand what the staffing needs are going to be. And then they staff to those needs. And then they end up in this pattern of reacting to what the needs are from externally imposed forces, other teams, or consumers or whoever it might be.

So, the issue there is if you are staffed to your needs and you have a good sense as to how to operate and you’re reacting to the needs and you set up agile methodology for example and you do your burn down charts and all this other stuff that a lot of teams tend to do, you end up in this mode where the engineers are feeling uninspired, unchallenged. And ultimately engineers what to be impactful and they want to solve deep, meaningful challenges and they want to do innovative work, I think by and large or at least a lot of… that’s one of the things is screen for. I want people on my team to want to do those kinds of things.

So, if you end up in this reactive mode, it ends up being cancerous to the team. Because what ends up happening is people feel uninspired. They feel unchallenged. The work starts to degrade because the morale is a little bit lower. Over time…

AVDI:  To be clear, you’re talking about a mature project where it’s mostly bug fixes and little feature adds, right?

DANIEL:  Not necessarily. No, it doesn’t have to be.

AVDI:  Okay.

DANIEL:  I think that you can end up in this trap no matter where it is in the process. If it’s in maintenance mode, the cancer can hit there. If you’re just starting out and you don’t have the right mindset of innovation, it can happen at any stage, I think. And I think the issue is if you don’t give people the breathing room in order to really think beyond what is being asked of them and imagine other things, then you can’t actually do an innovation that will transform the way the team is operating. That’s the problem.

So, if you’re trying to solve a specific problem set and you’re just doing it based on a constrained set of variables, then you’ll solve the problem. But it’s not really transforming things. It’s incrementally improving things. And that incremental model, if you persist that for long enough, people will start to feel uninspired. They’ll want to move on. And if one person moves on then you have attrition that results in the same amount of work distributed across a fewer number of people, which translates into more reactive behavior.

AVDI:  So, you’re talking about the ability to step back?

DANIEL:  Yeah. You need the introspection. You need the room for people to think beyond what is being asked of them and just imagine maybe a greenfield state of what can really transform the way that the team is operating or the problems that they’re trying to solve. That’s where…

AVDI:  Do you a concrete example of one of those transformations so that people can get an idea what you’re talking about?

DANIEL:  Yeah, absolutely. At Netflix when I first got here, one of the biggest issues that I had, my team was relatively small. It was staffed actually under our needs. And the first couple of months that I was here, our service was failing on a consistent basis. So, I was on production alert calls several times a week, spent many Sunday evenings, which is a time when a lot of people watch Netflix, on calls trying to recover systems. And for the first couple of months, it was just completely untenable.

So, basically I determined that there were two key things that the team needed to do. One was, I needed to staff up. And the staffing was not only to get to the point where I can just get things done. But I needed to get beyond that so I could do something transformative. Because the current state in which we were operating in was just completely untenable. So, we staffed up. I hired maybe five or so people initially. And then I basically said the first thing we need to focus on is solving our resiliency problem. Because the pattern that was happening was our service could die or falter, in which case people couldn’t stream anything. Or if other teams underneath us, the teams that we consumed from, if they were failing, we were faltering as well. And we weren’t protecting against that.

So, we decided we need to solve the resiliency problem first and foremost. So, what I did was I had more people that I could have execute on things. But I also pushed back on a lot of things and say, my team needs some breathing room to solve this problem. So, I deliberately tried to create that space so that they could think about what problem we can solve in a fundamental way. Not just how do we keep this system up but build a system that is sustainable and has longevity to it. And that’s when the team created Hystrix, which is a fault-tolerance [and latency] isolation system. Basically what that means is if any of the subsystems underneath us, if they falter, we’re basically going to pretend it’s dead and move on and do a fast fail and deliver some fallback state to the service. Whereas previously that would cause us to falter as well. And that was applicable to all systems that we consumed from.

And it was also a system that you could have other systems use. And so, it was a portable wrapper library that ended up being useful to many teams within Netflix and now is open sourced and other teams are using it outside of the company as well. But that was a hugely transformative step where you create this one innovation. Well now all of a sudden, we don’t have the burden of all the time being spent on trying to recover systems and trying to patch-fix small things that we were finding every time we would falter. Now we have a system that protects us from all that lost time.

And what that did is that enabled us to have even more time now to not only do the externally imposed requests from other teams but now we have more time to do more elective things. And once you have the time to do more elective things, you could find new innovations. And so, that’s when we moved away from the REST API to what we are now operating in, which is this scriptable Java API layer. We could talk about that if you want. But that model was another big innovation that really transformed the way that Netflix operated. And once you do that, you create more space for more innovations. And that’s the upward spiral part of it.

AVDI:  I’d love to talk about some of these technologies as we go on. But there’s one question I’m dying to ask about this idea of making space for innovation. And it’s this. In a lot of organizations I’ve been a part of, there’s been either a team or sometimes just one person, a lot of times just one person, who together or just the one person says, “I see this problem, this recurrent problem. And I want to fix it in a really general way.” And so, they go off into a corner and they start working on fixing it in a really general way. And I saw this. A long time ago there was a Rails project where somebody basically decided they were going to more or less rewrite Active Record in a way that would automatically batch all the queries that we were producing altogether into one big monster query ‘automagically’. That’s just an example. I’ve seen this over and over again.

And what happens a lot of times is that off in the corner project just keeps going and going and going. And at the weekly meetings they say, “So, is that thing done yet?” And they’re like, “No, I just have a few more kinks to work out.” And I think that’s the thing that’s at the back of a lot of project managers’ minds when they hear, “Oh, I’m going to step back and solve this whole problem in a really general way,” as they’ve seen stuff like that happen. Do you have any guidance for balancing solving these general problems against just perpetual churn like that?

DANIEL:  It’s a great question. I have two thoughts on it. First, in the early stages I actually don’t want to put any constraints. I think a deadline, it’s going to be a real limiting factor to the solution itself. What I mean by that is if you say, “Okay, well you have now three months to go solve this problem in a general way,” that duration automatically at that moment becomes a constraining factor on the possibilities and ensuring that we’re going to get the best ideas out of this implementation. So, in the early stages I don’t want any constraints like that. I want to give the breathing room and let them go and do it.

The other thing by the way is you can’t be off on an island just going and do this. It’s got to be in concert with many teams. If this is going to solve a problem in a general way, it’s got to be talking to all of these other parties to make sure that you’re actually solving the problems. I see too many times centralized teams or architectural groups or whoever it might be solving what they believe to be problems but not doing it with proximity to the real use cases. And if you end up in that mode, once it’s “done”, it’s not really usable.

AVDI:  Mmhmm.

DANIEL:  So, you can’t just be off in the island. If you give the space to allow the person or the people to really imagine what this could be and not constrain them, I think that is a huge advantage and I’m willing to incur cost in the early stages and deal with that ambiguity, which I think a lot of people are not. So, you need to be comfortable with the ambiguity. You need to be comfortable with the idea of possible failure, which I think a lot of other people are not. I’m okay investing in some projects that maybe it will work, maybe it won’t. Maybe it will take three months, maybe it’ll take six. And betting that I have great people who I’ve assigned appropriately to the right things and that will execute on those things. So, it takes a little bit of a stomach there, which I have. And I encourage others to have or adopt.

But after some point, then I think you do need to start putting some constraints around it. And it’s not so much constraints like, “Okay, you said this delivery date.” But you need to start building expectations, allowing the product to have momentum, and external momentum. So, people need to start building expectations around it so that they can plan for it. And then you need to start trying to hit some deliverables. So, I think the incrementing deliverables would be the better route. But if that’s not possible, you still need to set expectations and hit some of those critical milestones.

So, give it some space. Eventually add some of those constraints and then start delivering on it and then incrementing on after that. We don’t have project managers here. We tend to use engineering management to marshal those things and maintain your relationships and set the constraints and deadlines. So, we don’t have that other project management perspective. So yeah, those are some of the key things that I do when we’re doing these kinds of things.

AVDI:  Thanks. That’s very helpful.

CORALINE:  How did you win support in space from managers in that Hystrix project? Do you have practical advice that other people could maybe use to make sure that they’re given space for innovation?

DANIEL:  Yeah. I think you need to be crisp and clear on what the benefit is and what the tradeoffs are going to be. So, for example in the Hystrix model the benefit is, “You know all those outages that we keep incurring? We don’t like those. So, how do we not do those? My team is going to invest in something that will hopefully systematically take care of some percentage of these in a meaningful way.” And the cost is going to be, “Well some of those tests that we’re trying to do for the product, we might have to defer some of that work. We’ll keep doing some of these. But these that are more elective for the company that are not more fundamental to what we’re trying to do, we’re going to push those off and free up that time.” So, that’s one aspect.

The other, it goes back to hiring beyond your needs. So, if I can staff my team beyond the point where all of those product tests need people, then I have bandwidth that I can distribute across all of these elective things. When we did Hystrix it was more about a little bit of horse trading. But now we’re at the state where we’ve done so many things and so many innovations. And we’ve staffed up effectively that we have the breathing room to do the things that are externally imposed as well as all of the elective things that we want to do.

And by the way, there’s another important component to this which is building trust and confidence that you can deliver on the things that you do. So, as you deliver on more of these things, then other people have more confidence that when you say, “Let’s do this tradeoff,” that that is the right tradeoff. And also when you say, “You know what? I’m going to add these three people to the team,” there’s confidence that those three people are going to deliver a value for the company. So, you got to build the trust over time as well.

CORALINE:  So, aside from the large-scale goals like the resiliency, what about when you’re in maintenance mode and small feature development? How do you ensure that there’s room for innovation there? Is it a matter of looser requirements? Or exactly how is that managed?

DANIEL:  Well, so one of the things that I say in the blogpost that you mentioned is there’s no such thing as maintenance mode, which is an extreme position. It’s not really meant to be truly literal. But I think if you have the mindset of this thing is not in maintenance mode then you can actually liberate yourself to think about how you can do this in a more sustainable way. Now, some products for example might be, “Okay we really don’t care about this entire service anymore. And the cost of moving it over is high and we can’t retire it just yet,” something like that could be maintenance mode. But for a team that is operating a suite of services, I think there’s a tendency, as barnacles continue to mount up on the service, to just let them sit over here and we’ll support them as they go.

And for those kinds of situations, I believe that no actually, you shouldn’t just let them coast over here and be this isolated island that eventually will introduce problems through drift. What you should do is actually spend the time and energy to imagine how that could fit into the future system. So, in those kinds of modes it wouldn’t be like, “Let that thing just continue to sit in maintenance mode,” or, “Let’s just increment features over here a little bit.” It’d be, “How can we reimagine a world where this is just fundamentally better?” and move those things into that future. And again, that goes with a little bit of space, pattern recognition on seeing what kinds of features are continually cropping up, or what kinds of support issues are continually mounting on this system. And then trying to reconcile that pattern rather than handling them on a one-off basis.

CHUCK:  So, if we’re done talking about that line of discussion I have another topic that I want to bring up. So, what I’m looking at, you wrote a blogpost that basically said ‘Why REST Keeps Me up at Night’.

DANIEL:  [Chuckles] Yeah.

CHUCK:  Do you want to just give us a brief outline of that? And then we can talk about that. Because I know that a lot of people have adopted REST. They’re super excited about it. Of course, when I talk to my clients they say, “We want a REST API,” and I say, “What kind of REST API?”

DANIEL:  [Chuckles]

CHUCK:  So yeah, do you want to go into that a little bit?

DANIEL:  Yeah. So, as you just [alluded] to, REST means potentially different things to different folks. The general premise of this article is it hinges on this other idea that I’ve been espousing in the API space which is, don’t adhere needlessly to standards or generally accepted approaches. Solve problems that are beneficial to the company and pick the right technology or standard or whatever it is to solve that.

And so, for Netflix specifically we used to have a REST API. It was public as well as when I first started at Netflix it was becoming the basis for a lot of our device implementations. But it ended up being a problem for Netflix as a whole as we expanded our device ecosystem. So, as it is today we have over a thousand different devices. We have dozens and dozens of UIs and lots of engineering teams that are consuming from the APIs that we produce. And what I was seeing maybe three or so years ago, three and a half years ago, was that as we had more and more of these teams consuming from the API, the needs were diverging more and more, as well as the number of requests that were coming in, were impairing our ability to deliver on those needs.

So, what I mean is if you have the PS3 being developed over here they want specific things to make the PS3 experience optimal. And so, they’re trying to get data from the API through certain request/response models. And then meanwhile the iPhone team or the Apple TV team is looking for something fundamentally different. And if we have a one-size-fits-all, that’s really the critique here, is if you’re trying to build a one-size-fits-all model which tends to be a REST-based implementation, trying to adhere to all of these requests becomes a problem because some of these requests conflict with each other.

And the more of these teams that are trying to consume from the API, the harder it is for one centralized team to handle all these requests. So, you end up deferring requests or taking shortcuts or ending up with spaghetti code and all kinds of complexity. And so, it really created a problem for my team. And that’s when we decided, this is after Hystrix and we had more space and more staff, to take on a larger initiative which we called internally .next. It was the next implementation of the API which completely diverged away from traditionally REST one-size-fits-all models. And instead we basically empowered the consuming teams from the API to build their own APIs, which tend to be very ephemeral in nature.

So, what I mean by that is we built an ecosystem where we supplied in our JVM a Java API and was very granular in nature. Element-level requests could be made on a method to get a specific data, piece of data or set of data. And on top of the API would live these Groovy scripts that each team would own and operate. And they would deploy those completely distinctly inside the same JVM from the WAR file that we would push for our Java APIs. As their UIs would iterate, they would create a Groovy script and an endpoint dynamically that would deliver the document that they would want across to the device. And as they deployed the device code they would deploy their corresponding Groovy script. And that would automatically compile into our JVM. And those Groovy scripts would live for possibly a day, maybe a week, maybe a month. But they would ultimately replace and a new endpoint and a new Groovy script would be created when they did their next revision on the client code.

So, there are two benefits there. One is the Groovy scripts enable the UI teams to customize their interaction model per device or per request basis. And the other thing is they can iterate those independently of the work that we’re doing. So, we’re no longer in the critical path for changing the APIs as long as the data exists in the system today. If the data needs to be injected in, then we have to be involved in that. But it really created this empowerment model where the teams that are closest to the actual needs of the data could control the flow of the data.

One of the other posts that I created or that I wrote for The Next Web was around separation of concerns. And this was a key point in our model as well, where if you look at the three core functions of an API, they’re basically: someone needs to gather data, someone needs to format data, and someone needs to deliver data. The API team cares a lot about how the data’s gathered. But the consuming teams don’t care that much. They just want to make sure that it is gathered. And in traditional one-size-fits-all models, the formatting and delivery, those are typically also done by the API team. But in fact, those are the places where the UI teams care a lot about it. And the API team doesn’t really care that much as long as it’s supportable.

So, what we did is we decided to break down based on separation of concerns where the API team focuses on the gathering of the data and ensuring that the data can be gathered correctly. And if not correctly, then some fallback state can be provided. And the UI teams can focus on the formatting and delivery that they care about. And we just create a platform that enables that to happen. So, that’s the premise of this ‘Why REST Keeps Me up at Night’ blogpost. And that’s why we moved away from this one-size-fits-all REST model.

CHUCK:  So, I’m curious then. You kind of talked around it a little bit, but what did you move to?

DANIEL:  It’s something we created.

[Chuckles]

DANIEL:  It’s that Java API with the Groovy scripting layer in it.

CHUCK:  Right.

DANIEL:  And it’s just something distinct. I don’t know.

CHUCK:  So, there’s not a good way to describe or something that it’s like something else that we understand?

DANIEL:  Not really. It’s just a distinct model. This goes back to the idea of solving real problems and not adhering to just whatever standards are out there. So, the easy path is while everyone’s doing REST we have this REST API. We should just continue to iterate on that. But it was fine for solving our problems. But it wasn’t fundamentally addressing core issues that we were seeing as we scaled up. And so, how do we just solve the problem in front of us? Let’s do this. So, we spent a lot of time internally just building, figuring out what this needs to do and then building that system and not worrying about what is good in the marketplace, what are people doing in the industry. So, as far as I know this is a completely distinct model. I haven’t heard anybody doing anything like this.

CORALINE:  Is the Groovy layer essentially composing data from the API and presenting it in a specialized way?

DANIEL:  So, each script which is written by the UI teams, they can call out to the Java API to get the data elements that they want. And then they can compose it however they want. And then they can choose how they deliver it however they want. So, it’s basically a model where they can write Java code. But in fact, it’s Groovy code, and tap into anything that’s available into the JVM in order to get what they need. And we put some constraints on that to make sure that they don’t do really evil things to the system. But yeah, what they’re doing is they’re calling into the Java APIs, getting whatever they need there, processing it, handling errors, doing whatever they need, composing a data set, and then sending it across the wire somehow.

AVDI:  You touched on Hystrix and dealing with services that go down and up and fall behind a little bit. I’d love to hear a little bit more about that. And particularly, I’m curious if there are patterns that people should understand when they’re dealing with making systems resilient.

DANIEL:  One thing that is a certainty is that in a complex distributed environment, things are going to fail. And so, you have to take that for granted and always prepare for things falling out from under you. Having systems that are resilient against that is a fundamental aspect of what we try to do here. And we incorporate things like Chaos Monkey and assume I need to try and inject failure to figure out where those failures are going to happen in a controlled way.

What Hystrix is doing is: getting incoming requests from a device hits the API layer first at the Groovy point. Groovy is going to say, “I need this data to satisfy this request.” Then it calls into the Java APIs to get them. And then for every one incoming request from a device, we make on average seven or so outbound calls to dependent services. So, these are remote calls to other teams’ systems. And we need to fetch data from different boxes here in order to compile the data that is needed for that one request. So, for each one of those outbound calls any one of them could fail. And it can fail for a variety of reasons. And the failure could be an isolated failure or it could be a fundamental failure.

And what I mean by that is there could be some sort of weird latent issue that’s happening on this specific call. That’s more isolated, in which case we might do a retry or just deliver a fallback state for that one. But a more fundamental failure is let’s say the service is down or it’s just being overloaded so it’s timing out. All requests are timing out because it’s just overly latent. In each of those models, Hystrix will treat them differently. So, in the isolated one we’ll try our best and then we’ll keep moving on. But if enough of those calls are failing, Hystrix will automatically detect that error rates are high or the overall error rate, or we’re exceeding some threshold on latencies or we’re queueing up and things are backing up on our server as we’re making calls outbound.

Whatever it might be, Hystrix will see an accumulation of issues. And once it sees the accumulation of issues it’ll automatically flip the circuit. So, this is a circuit breaker technology. So, the circuit will flip and basically say, “We’re not going to call this service anymore because we’ve accumulated enough of these kinds of error types and exceeded some threshold.” And so, each service by the way could have a different threshold. It could be 10% or 20% or 50%. But if you exceed a threshold of successes to other issues we say, “We’re going to pretend this service is dead for all future requests until we establish that this service is healthy again.” and in the interim while it’s not healthy, we’re going to do some other event. And preferably it is provide some other fallback data. But if we can’t do that, then we’ll just say, “Well, that service is dead. Let’s fail quickly instead of building up a queue of requests.”

Because the pattern that you can see there is if you don’t allow for this fast failure or delivering of some other fallback state, the queueing up of requests as some other service underneath becomes latent ultimately drains resources on your system. And as it drains resources on your system, eventually you will tip as well. And that’s something that my team cannot handle because if our service tips, then nobody streams anything. So, Hystrix after we get to that state where we flip the circuit, periodically we’ll call back to the system to see if it’s healthy again. And once it’s healthy again we’ll flip the circuit back. And then a normal request will flow again. And hopefully the lack of requests going back to that service alleviates some of the strain on them which gives them an opportunity to recover from whatever they’re dealing with.

An example of this would be let’s say we are trying to build the recommendations for Netflix. And so, you go to the home screen. There’s a whole slew of titles. There’s a whole bunch of rows containing those titles. Those are all personalized for you. And when the call comes in from a device, we call out to the recommendations service underneath us. If they are backed up, we might try again. But if the circuit is flipped or if an isolated occurrence does not reconcile, then we will go to a fallback state, which basically means instead of delivering personalized data we’ll call some other cache that has just a generic set of popular titles. And so, we’re not dependent on that service anymore. We’re calling something else in order to get this generic data. And we’ll deliver that because that’s better than delivering nothing.

So, that’s basically how Hystrix operates. And it’s really designed to protect our service from any of these queue up or error situations on all these dependent services that ultimately can back up and affect our service, which would be a real problem for the system.

AVDI:  One of the things I love about that architecture is that it seems like you’re treating requests as objects or as processes in their own right rather than trying to treat them as method calls.

DANIEL:  What do you mean by that? I’m not sure I understand.

AVDI:  Well, I was looking a little bit at the documentation of Hystrix and it seemed like you were bundling up requests as objects in their own right. There’s a tendency in APIs and in making requests to treat it as method call, to treat it as a transactional thing. But it seems like you’re bundling up requests as their own entities. And even from my understanding of it, and correct me if I’m wrong but it looks like you’re actually giving them their own thread to run in, at least in some modes.

DANIEL:  Yeah. I think you’re exactly right. Basically what we’re doing is we’re saying there’s actually a distinction between incoming requests that we handle and the outbound requests that we handle. So, the incoming request, that hits the Groovy layer and that is this method call but it’s isolated from the rest of the API ecosystem. But once it enters into the Java API world then we treat it very much like a set of events that we need to process. So, we use RxJava and the reactive process to try and distribute these events to get the data. They’re basically isolated events from each other. And any one of them could potentially fail and not necessarily adversely affect what we deliver to the Groovy so that the Groovy can then process it. So yeah, it’s more about event isolation.

CORALINE:  Did you write that architecture incrementally or was that something that was planned upfront?

DANIEL:  So, Hystrix was there… that was one of the key things that we built initially. On top of that, that’s when we started building what we call the .next architecture which included a range of things. That included the core Java APIs which included a service layer which is basically a set of method calls or object calls that the Groovy layer can call into. We supported the Groovy as well as a system that allows them to upload the Groovy dynamically and compile it into the JVM. We built that as well. RxJava came out of this. I think that was on the tail-end of some of the optimizations we were doing around that. Yeah, so it was all part of the core architecture as we were building it out. Some things emerged as we went. But yeah, it was all tied together.

CORALINE:  You mentioned RxJava a couple of times. Could you explain what that is?

DANIEL:  Yeah, so Reactive Extensions, RxJava… or Rx is Reactive Extensions which was created by Erik Meijer and Microsoft, it was primarily a .NET implementation. It’s really about event processing and stream processing of data. So basically, you open a pipe and allow the data to flow through. And it’s different than request-based and thread-based handling. So, what we’ve developed was basically a Java implementation to allow for that streaming of events to happen. And then we open sourced it. So, RxJava is up on our GitHub repo as well. We use it for a range of things including our interactions to the backend services. So, we can open up streams to go get the data. And it just handles it as events and allows it to flow through.

We’ve been experimenting with an implementation of that with Netty as well as using it for stream processing around insights so we can get real-time analytics. When you deal with the massive amount of traffic that we deal with and a massive amount of data, it’s hard to get that data for logging information and understanding the operational health of the system. It’s hard to get all that data and get a real-time perspective on it, especially in the cloud where systems are coming and going and there’s ephemerality to the systems as well. So, we’re using Rx to basically open up streams of data and allow it to flow through. And we can observe the data as it’s flowing and then create insights around that.

CORALINE:  I should point out that there is a Ruby implementation of Reactive Extensions that’s currently being rewritten.

DANIEL:  Wow. Who’s writing it?

CORALINE:  A friend of mine, actually. He’s working with the team at Microsoft to do the Ruby implementation of that.

DANIEL:  Awesome.

AVDI:  I’m glad to hear it’s being rewritten.

CORALINE:  I understand there were some problems with the original implementation.

AVDI:  I tried to get the original implementation working. But yeah, emphasis on tried.

[Laughter]

DANIEL:  Yeah.

CORALINE:  I have a question about team dynamics if we’re done with technical questions from there.

CHUCK:  Go for it.

DANIEL:  Sure.

CORALINE:  Did I understand correctly that you only hire senior engineers and you do not hire junior and mid-level engineers at all?

DANIEL:  For the most part that’s true, yep. Company-wide.

CORALINE:  Do you find that with senior engineers’ egos tend to get in the way? Or is that something that’s easily managed?

DANIEL:  You know, that’s a great question. It’s a tough dynamic. I consider my team to be a very, very strong senior engineering team. And sometimes we end up with conflicts where they disagree or one person is encroaching on the other person’s work. It’s important to have strong management too, so senior-level management who can help navigate those situations. And sometimes it happens. There’s just no avoiding it. It’s just the nature of having people who are very opinionated, seasoned and experienced. But if you’re creating the right context, sending the right people on the right projects with the right context, you can most often navigate that. And I think people are also, there’s a maturity that comes with that seniority as well. And so, they understand more the personal dynamics and the importance of that as well. So, it’s always a challenge from a managerial perspective. But I think it’s navigable.

CORALINE:  Are those interpersonal skills something that you specifically screen for in the interview process?

DANIEL:  Definitely. We definitely try. So, we have in our culture slides a range of values that we aspire to have and we screen for. And that’s very important to me. And some of them are honesty and courage and communication as well as passion and curiosity. These are important because if people could be honest and have the courage to say the tough things and challenge the right issues and they are good at communication, those work in concert with each other. And we definitely spend time talking to people about, tell us about these kinds of things. Give us experiences and examples of how they operate this in these kinds of situations to get at those values to make sure that when they come here, that they are a culture fit.

And a lot of times I’ll get questions about whether the culture slides are accurately reflecting the way that we operate or if it just sounds too good. But the reality is this is core to who we are. It’s important that we are honest about our culture and that we can screen for it and be consistent on it. And if we’re not honest about it it’s just going to result in people coming in who do not meet those cultural needs. So, we spend a lot of time in the interview process, in the recruiting process to make sure that we get the right people.

CHUCK:  Alright. Anything else we should go over before we wrap up the show?

CORALINE:  I was thinking someone should talk about Conway’s Law and how that applies to Netflix but I don’t really have the question phrased.

DANIEL:  [Chuckles]

CHUCK:  Well, Conway’s Law is what? The communication of the application mirrors the communication of the organization?

DANIEL:  Yep.

CORALINE:  Exactly.

CHUCK:  So yeah, you’re talking about all these services. And it sounds like there’s a lot of chatter going back and forth, right? You said one calls the API, means seven or so calls back to other systems. So, how exactly does that work? Do you have a lot of teams that all mirror that? Or how does that all work out?

DANIEL:  Yeah, so I described earlier that we have a bunch of specialized microservices who are focusing on solving a specific problem. And the way it actually translates is that it’s not just that they make themselves available to the API and then we handle it. There’s actually a lot of communication across those services as well. Each team is responsible for talking to the consuming teams and the teams that they consume from to make sure that any interaction point between the computers has an interface that can satisfy all the consumers and dependencies in that exchange between the computers.

So, if you take one service and say that another service needs to consume from it, the two computers need to talk to each other. But in order to get the computers talking to each other the people need to talk about what that interface looks like. But any service within the company has a growing number of those teams. So, it’s not a one-to-one in computers or in people. It becomes a many-to-many. So, many teams are talking to many other teams about how their computers need to talk to each other. And what that translates into is basically every team needs to open up to every other team how they operate. And every other team needs to be available to the possibility of any other team consuming from it. What that translates to then is that any other team could potentially impose requirements on any given team.

And Conway’s Law comes into play where all these communications across these people are going to start influencing how we create our interfaces to make sure that they can handle the needs of all of these other teams. But as that continues to grow and grow the number of needs is going to grow. And therefore the number of human communication points will continue to grow as well. And as the people continue to talk, the concern that I see around Conway’s Law is that the needs of all the many different teams that need to consume from this service will impose itself potentially adversely on the architecture itself.

So, the system was built to solve this specialty. We exposed all these interfaces to reveal that data of that specialty. But now it’s started to degrade because the breadth of use cases is going to continue to expand through all these communications that we’re continually having with all these other teams. So eventually, the architecture’s going to fracture, which again gets back to this need of having staffing beyond your needs and creating innovation where you can actually say, “Alright, we need to refactor this in a more fundamental way because the pattern is no longer working.”

And in some cases I think that there’s a benefit to creating isolation pockets across these teams were you can sever the potential of other teams needing your service, which is hard. It’s hard to identify those lines. But if you can, then you can simplify the communication structure, the human communication structure, which will then allow for simpler architecture. So, it’s an interesting dynamic. It’s hard to navigate for sure, because you want to be open. You want to support the business’s needs. But sometimes by being overly supportive it creates more problems than it helps. Does that make sense?

CHUCK:  Mmhmm.

CORALINE:  Definitely.

CHUCK:  So, I’m just curious. What are the best parts and the hardest parts of working for Netflix? From just a team or a technology standpoint.

DANIEL:  The best parts a definitely the talent level here is fantastic. All senior-level folks, very experienced. I’ve never worked with as many talented engineers. So, that’s amazing. And dealing with the scale, billions and billions of transactions a day, tons of data that we’re dealing with, it’s really exciting and interesting. And part of that comes with not many companies have to deal with this level of scale. And so, it forces is in the situation of trying to find new ways to solve problems that maybe a lot don’t really have to deal with. So, there’s an exciting aspect to that as well.

Some of the challenges are when if first started at Netflix the company was five or six hundred people. We’re 1800 or so now. So, one of the challenges is as we scale, maintaining the culture, making sure that we don’t create overly complex systems, dealing with some of the Conway’s Law stuff that I was talking about before. That can introduce some frustrations at times. But by and large I think that the culture here and the way that we operate and the challenges that we have are just really exciting.

CHUCK:  It’s funny because it sounds a lot like my experience being on development teams where the best part of the job is the people and the worst part of the job is the people.

[Laughter]

DANIEL:  Right. Yup.

CORALINE:  Things would be so much simpler without people involved.

DANIEL:  [Chuckles]

CHUCK:  Yeah, but I don’t know if they’d be as rewarding either.

DANIEL:  You know, it’s interesting. So, earlier in my career I had a development and engineering start. There was an excitement that I got out of interacting with a computer. There was predictability, right? So, develop things, something would happen. It was causal. And if I didn’t know what was happening it was my fault, not the computer’s, right? The computer was always right. But as I got more into management I found that it was a lot more exciting to deal with the people stuff because it’s just completely unpredictable. [Chuckles] There’s always change. And it just always felt like a different kind of challenge to me. So, I shifted my thought on that.

CHUCK:  Awesome. Well, let’s go ahead and do some picks. Coraline, do you want to start us off with picks?

CORALINE:  Sure. I have one pick today. It’s called Fund Club. It’s a joint effort between AlterConf and Model View media. Basically once you’ve signed up you get a monthly email that highlights a specific project or initiative or event or organization that focuses on diverse communities in tech. You’re then encouraged to donate $100 to that month’s selection. Fund Club doesn’t actually handle the money. You submit directly to the recipient project. Then you confirm with Fund Club that you donated. In the next month you get a new email describing another project. So, they are just getting off the ground now. They just announced a few weeks ago. And they’re currently taking submissions for projects to fund. And I’ll put a link to the site in the show notes.

CHUCK:  That sounds cool. Avdi, what are your picks?

AVDI:  I have two picks today. First one is a site that I just found out about and it’s apparently been around for a while though, called The Codeless Code. And it is a collection of koans I guess you would say about programming. And they’re not koans in the sense of Ruby Koans of problems to solve. They’re more literally little Zen-style stories all by one author I believe. And they are simply amazing. They have that wonderful property of bringing out some concept using a metaphor or using an example. The few that I’ve managed to read so far, I just found out about this today, the few I’ve managed to read so far have been terrific. And I’m planning on reading through this whole site. So, The Codeless Code, that’s at TheCodelessCode.com.

And the other pick is not programming related. But it is kind of topical. I just found out today that one of my favorite children shows is back on Netflix, Trotro. It’s this adorable little cartoon donkey that’s a great show for babies and toddlers. And it’s just one of those… I love the shows that are gentle and quiet and aren’t all flashing lights and loud sounds. And one of our kids particularly used to love it. And so, it was tragic when it went off. So Daniel, I know that you personally brought that back for us and I thank you for that.

[Laughter]

DANIEL:  I did some research on you. I knew you would want it so I got it for you.

AVDI:  The timing was perfect. She’s up there and enthralled and very happy.

CHUCK:  Do you get that often? “You work at Netflix. Can you maybe…?”

DANIEL:  Yeah, a fair amount. I get a lot of tech support requests. And I also get, “Hey, can you put me in touch with someone who can review my script?”

[Laughter]

CHUCK:  That’s awesome. And by awesome I mean must be frustrating.

DANIEL:  [Chuckles] Yeah.

CHUCK:  [Chuckles]

DANIEL:  Awesome for them, not so much for me.

CHUCK:  Yeah. So, I’m going to make a few picks. The first one is, I may have picked this on the show before. I can never remember anymore. But there’s a podcast out there called Serial. And it’s serial like serial killer not cereal like breakfast cereal. And basically what it is, is it’s a 12-part series about a young man who was convicted of murder, murdering his ex-girlfriend. And so, it talks through all of the things. Well, what about this? And what about that? And they go and they find all the people that testified and might have known something. And [chuckles] it’s just really fascinating. So, if you want to go listen to it, all 12 episodes are out. Apparently they’re picking up another case or another series or something for another season. But anyway, the first season is awesome. So yeah, I want to pick that.

The other thing I want to pick. This coming Sunday, when this comes out it’ll be this past Sunday, is Father’s Day. And so, I just want to pick all the awesome dads out there. I know people sometimes have issues with their dads. But I was lucky. I have an awesome dad. I love my dad. And so, I’m just going to shout out to him and pick all of the great fathers that are out there that are doing great things.

Finally I also want to do one more semi announcement I guess. And that is that I’m going to be putting the RailsClips videos up this week, just a couple of them. And then they’re going to trickle out as a series since that’s what I promised. I know I’m a little behind on delivery. But they’re going to be out there. So, if you backed the campaign, thank you. And if you didn’t then you can go sign up at RailsClips.com. And there was also a stretch goal. So, keep your eyes peeled because I will probably put something out there so that people know where to go to get the webinar that I promised I would put out there about testing. So, I’m going to make it available to everybody. I’ll probably have some nominal fee to it and then the backers get in for free. The reason I do fees is mainly because if you sign up I want you to show up. And so, the money is a way for me to say, “Hey, don’t waste your however much.” I don’t think it’ll be a lot, but I’ll be enough to hopefully get people to come if they’re really interested.

So anyway, those are my picks. Daniel, what are your picks?

DANIEL:  I have a few. By the way Serial is an awesome podcast so I’m behind you on that one. I have another podcast that I wanted to recommend. This one’s by a company called Gimlet Media which is also an offshoot from This American Life as a serial where Alex Bloomberg created this company called Gimlet Media. The first podcast they created was StartUp. And it really is him chronicling his process of creating his startup of Gimlet Media which includes discussions around, how did they come up with the name? How did they get funding? How are they pitching investors? And it’s really open and exposed. It’s really excellent.

Since then they’ve also opened up two other podcasts, one of which is called Reply All. And Reply All is a podcast about the internet where they dig in on a range of historical and current things that are going on in the internet. So, they interview for example the guy who invented the pop up ad, and Jennicam if you remember her from back in the day. It’s really interesting. And they just started a third one called mystery show which is a woman who goes around trying to solve personal mysteries that you can’t discover or learn about or solve on the internet. So, those are all really good. Totally into the Gimlet Media stuff so far.

My second pick is a blog post by Chris Messina on his Medium site. It’s called ‘Seeking Genius in Negative Space’. And I think it’s apropos to some of the things that we were talking about earlier. It’s a blogpost going into the need for not just focusing on what’s in front of you but what is not in front of you and trying to see the possibilities with that negative space. It’s a really excellent post. And Chris writes a range of other things on the site. Another post that’s there is called ‘Full Stack Employee’ which is another good one. There is some personal stuff in there. There’s a lot of musings around technology and things like that. And it’s just generally a good site.

And the third one is I’ll just plug the Netflix Techblog and our GitHub repo. So, if people want to see some of the stuff that we’re doing, I’ll share those links as well.

CHUCK:  Alright. I was actually going to ask you for that, if there are places that people want to go to learn more, what should they do?

DANIEL:  Yep, there they are. We’ll get them on the site, right?

CHUCK:  Yep. Alright, well I don’t think we have anything else. So, we will go ahead and wrap up the show. Thank you all for listening. And we will catch you all next week.

[This episode is sponsored by MadGlory. You’ve been building software for a long time and sometimes it gets a little overwhelming. Work piles up, hiring sucks, and it’s hard to get projects out the door. Check out MadGlory. They’re a small shop with experience shipping big products. They’re smart, dedicated, will augment your team and work as hard as you do. Find them online at MadGlory.com or on Twitter at MadGlory.]

[Hosting and bandwidth provided by the Blue Box Group. Check them out at Bluebox.net.]

[Bandwidth for this segment is provided by CacheFly, the world’s fastest CDN. Deliver your content fast with CacheFly. Visit CacheFly.com to learn more.]

[Would you like to join a conversation with the Rogues and their guests? Want to support the show? We have a forum that allows you to join the conversation and support the show at the same time. You can sign up at RubyRogues.com/Parley.]

x