CHUCK: If someone tells you to get with the Times, does that mean that they’re hiring?
[This episode is sponsored by Rackspace. Are you looking for a place to host your latest creation? Want terrific support, lots of options, and easy setup? What if you could try it for free? Try out Rackspace at RubyRogues.com/Rackspace and get a $300 credit over six months. That’s $50 per month at RubyRogues.com/Rackspace.]
[This podcast is sponsored by New Relic. To track and optimize your application performance, go to RubyRogues.com/NewRelic.]
[Does your application need to send emails? Did you know that 20% of all email doesn’t even get delivered to the inbox? SendGrid can help you get your message delivered every time. Go to RubyRogues.com/SendGrid, sign up for free and tell them thanks.]
CHUCK: Hey everybody and welcome to episode 152 of the Ruby Rogues Podcast. This week on our panel, we have James Edward Gray.
JAMES: Good morning, everyone.
CHUCK: Avdi Grimm.
AVDI: Hello from Pennsylvania.
CHUCK: Josh Susser.
JOSH: Hey, good morning from San Francisco.
CHUCK: I’m Charles Max Wood and I’m half-asleep. And we have a special guest this week, and that is Jacqui Maher. Did I say that right?
JACQUI: You did, yes.
CHUCK: Do you want to introduce yourself?
JACQUI: Sure, yeah. I’m Jacqui and I work at The New York Times in the R&D Labs. So, hello from New York.
JAMES: Very cool.
JOSH: Well, hello New York. [Chuckles]
JAMES: Hello New York [chuckles]. So Jacqui, you have been telling us about all the cool things The New York Times uses Ruby for. So, why don’t you just tell us how that got started?
JOSH: Actually, I have a first question to ask which is, how did you end up at The New York Times?
JAMES: Yeah, that’s a good question.
JACQUI: So, I ended up at The New York Times, I guess I started in 2009. But my interview process started a year before that. My friend Jake Harris, who some of you might know, is someone that I know from the Ruby scene in New York. And he mentioned that there might be an opening on the team at the Times that he was working at, which was the Interactive News Desk. So, I went in for an interview in, that must have been in October or so of 2008. And well, the interview went well I thought. And then I got positive feedback. And it turned out that the Times was having a hiring freeze and they couldn’t hire anyone, unfortunately. I don’t know if you remember what that era, 2008, was like for the news industry. But there were a lot of doomsayers, right? That message has thankfully changed and things seem to be looking a lot better for media. But anyway, at that time they weren’t hiring anyone. So, about a year later I got a phone call from the person who would become my boss, Aron Pilhofer, and he asked me if I was still interested. So, I definitely was interested. So yeah, that’s how I ended up there.
JOSH: And I guess going to work for The New York Times is just, there’s so much history around that name.
JOSH: So, it seems like going to work there must just be, we had Sarah Allen on recently and she was talking about working at the Smithsonian. It’s like that same sort of thing. It’s a name that everybody knows and you just… it must be so revealing inside there.
JACQUI: It is. I was incredibly intimidated my first day. No, that’s not even fair to say. I was incredibly intimidated my first year working at the Times. My first day was basically walking into the news room of The New York Times being introduced to people whose bylines I recognized, people who have been working at the Times for 10, 20, sometimes even 30 years. And just sitting down in the middle of all that and being told, “Now go do your thing,” [laughs]. So, it’s… any new job you want to do well and you want to maybe impress people. Or you want to contribute in a positive way to wherever you’re working. And for me, working at the Times takes that to a whole new level, because I grew up reading The New York Times. I just have a lot of respect for the journalism and the innovation that this company does.
JOSH: Cool. So, innovation. Ruby.
JOSH: Wow. So, you said you were covering the Olympics.
JOSH: And that’s the sort of thing that a reporter says.
JOSH: I’m covering the Olympics. So, do you have that same sort of reporter attitude, like your job is providing news to people?
JACQUI: Absolutely. And I think this has been a topic of discussion for the last several years, and it still is, the topic of what is a data journalist, or what is the right term for someone who does what I do and what various other news application teams at other organizations do? Is it a hacker journalist? Is it a news developer? And are they reporters or not? But the way I think of it is whether you’re producing words for a story or data to go along with the story or the data is the story, you’re reporting. You’re covering.
JOSH: So, I guess a name that’s familiar to people around that would be Nate Silver, the FiveThirtyEight guy.
JOSH: Who turned data into news in a huge way.
JACQUI: Yeah. And he used to work at the Times. He actually sat a few desks away from me.
JOSH: Oh, right.
JACQUI: Yeah, yeah. So, now he’s doing a new, he has a new project of course, outside of the Times.
JACQUI: But yeah, he became very well-known for, well for me it was his baseball analytics, but of course the political spectrum calling elections and polls.
JAMES: So, let’s dig a little deeper into this Olympics example, if you don’t mind. You sent us a great article explaining it. Can you just talk a little bit about what the system took in and how that worked? Because it’s really interesting.
JACQUI: Sure. It’s going to involve XML, [laughs] if that’s alright.
JAMES: Woohoo. There goes our family-friendly rating. [Chuckles]
JOSH: Yeah, can we insert Mike Dalessio’s quote?
CHUCK: What’s that?
JACQUI: Was it Mike or Aron, the “XML is like violence”?
JOSH: Yeah, if it’s not working, you’re not using enough of it.
CHUCK: Didn’t Dave Thomas say that it was a DSL for Java?
JACQUI: Oh, that would explain a lot.
CHUCK: Anyway, you were saying.
JACQUI: Sure. So, when I first came onto the Interactive News Desk, it was towards the end of 2009. And a couple of people on the team were talking about Vancouver. And they were talking about this XML stuff that they were going to have to wrangle. And I was brand new and I didn’t really know what they were talking about. And suddenly, I was fulltime on that project. And that was my introduction to the Olympics and the Olympic data and a whole world of jargon and acronyms that are, yeah, it’s kind of amazing the whole spectrum of things that the Olympics entails. So, the way the Olympics coverage at the Times works is that we get a feed from the IOC, the International Olympic Committee. And what that feed entails is first, a ton of XML. But it’s everything from who the athletes are, participating in the games, what events are happening in the Games, when they’re happening, all of the results. And the results are not just so and so got first place. The results are incredibly detailed. And they send them out in, I guess at some point, the format of this XML makes sense for the systems that power the scoreboards, let’s say, at the venues. But when you’re sitting there as a developer trying to make sense of the stuff that you’re seeing, it’s very arcane. So, a lot of the, I’d say we spent a good maybe one-third of the entire development cycle, which lasted over a year, just trying to understand the data and how it maps to the events that we will watch on TV, or if you’re lucky to go to the Olympics, that you see in person. So, what in this very long XML message that is talking about extended results and extensions and logical dates, I think the logical date concept in that data feed was my favorite [chuckles], how does that actually map to say Usain Bolt getting first place in the 100-meter dash? So, that was a big chunk of our understanding. And then of course, we had to parse the XML. And to do that, we used Nokogiri which was great. But the speed at which you get the messages and the order in which you get the messages is incredibly important, because sometimes the messages will undo previous ones. Sometimes they add on to it. So, we had to build a lot of logic and set up a pretty extensive Resque set of queues to handle that.
AVDI: Those messages, they were being pushed to you, weren’t they?
JACQUI: Yes. Yeah, they were being pushed to us.
AVDI: So, you had to keep up?
JACQUI: We did, yeah. There are ways of falling back and getting them to send you the message again. But if you’re trying to do anything close to a real-time coverage, you don’t want to have to fall into that. And the thing about missing a message is that you might not necessarily know that you missed it. So, how would you even know to say, send it back.
JOSH: So, the whole description that you had of the ingest process for the XML and how crazy optimized very particular pieces of it had to be, that was fascinating. But then I love that it was this really interesting combination of very simple, straightforward tricks and some very sophisticated things. And I thought of the image of a carpenter using some sort of $300 laser sonar stud finder and then a hammer.
JAMES: That’s a great point. Maybe we can talk about that just a little. In the article, it mentioned how, and Jacqui mentioned again, that they’re coming in at this big way. An individual message could be up to 20 megs of XML. There’s not even really time to, in the thick of it, run it through the XML parser, because you need to be grabbing the next message. And so instead, there’s a very simple Rack app that hit the first line with a regular expression to figure out what kind of message it was, and then just stuck it in a queue basically, in a file, and then queued it for parsing. And so, it just, you flow them in as quick as it could and then decisions were made later based on how important something was and how behind they were in the queue. So, should you skip messages right now because this one’s not vital and we’re under heavy load and we’re behind? So yeah, Josh is right. It was a neat combination of, let’s use these simple tools, get this info in here, and then we’ll try to go to more robust systems to make sense of it and do what we can do with it. It was cool.
JACQUI: Yeah, absolutely. Yeah, we ended up just adding a page in our internal add-in for the Olympics that had two buttons, or really two buttons for every live event that was happening, just saying please turn off the real-time data, because if our systems were falling too far behind, just having the ability to maybe just get those full messages instead of those incremental results. Yeah.
JAMES: We were kind of talking about this before the show, but we haven’t really brought it up on the show. There’s a lot of pressure from the readers and stuff, the people following these events, to get the info the instant it happens, right?
JACQUI: Yeah, absolutely. And not only to get it the instant it happens, but to get it accurately. So, doing something fast and doing something well and accurate and well, test it and everything with that verified, are two different things. And to have to do both of those at once is very challenging. So, we try to be as careful as possible. We went through many test events, both on our own and as orchestrated by the IOC. So, I’d say in the year or so leading up to any Olympics, the IOC is conducting test events. And those test events, I thought this was really weird and fascinating, but the test events can be real events that would have been happening anyway. So, every sport has its own World Cup.
JACQUI: Even my favorite, the Modern Pentathlon, has its own World Cup. And so, what the IOC will do is they’ll get the people that handle the data, which is Swiss timing, a.k.a. Swatch in Switzerland, they’ll get them to show up at these events with all of their equipment and produce the data feed for the Olympics as if it was the Olympics. And sometimes, the test events are just completely made up and they’re just running scripts on their servers.
JAMES: That’s interesting. So, it gives you a dry run before your system has to hold up for the real thing.
JACQUI: Exactly, yeah. The scariest part of building that project though, for me, was that we never got an end-to-end happy path of the data until the Olympics actually happened. So, we crossed our fingers, to be honest, and hoped that the documentation that we were provided would actually be accurate. We would get a full suite of messages for a particular event. But everything from the opening ceremony to the closing and all of those events happening in different combinations at the same time, we were never provided with that. So, we just had to make a lot of educated guesses. JAMES: Wow.
CHUCK: So, you just had people stay up late?
JAMES: Yeah, [chuckles] everybody on call, pretty much?
JACQUI: Yeah. We divided into shifts. My colleague Ben Koski and I were actually sent to London for the games. And so, it was a five-hour time difference between London and New York. So, we had a team in New York that would take over towards the afternoon, London time. But I was having to meet up with the Times team at King’s Cross station in London at seven o’clock in the morning, which is not a time of day I’m usually awake for.
JACQUI: And then having to get on their special high-speed train, which was called the Javelin because we have to torture metaphors [laughs]. And go through security at the Olympic park and then take a bus to the media office, and then finally find myself in front of a computer, hopefully before the first event started, which were usually at 8:30 in the morning.
JAMES: So, I have a question about this coverage. We’re talking primarily about the statistics of the Olympics, obviously – who’s running and what speed, and those kinds of things. How does that relate with the more traditional journalism side. And how do you marry those two? I’m assuming The New York Times also wrote pieces about the Olympics. And was that in any way related to the data? Or were those handled as two totally separate systems, or what?
JACQUI: I guess the answer is a little bit of all of the things you just said. There were of course, lots of stories written by reporters without using data for the Olympics. But there were also many cases of collaboration, I guess, between what we would more traditionally call reporters who are just writing stories, and myself and the team producing the results and the website package for the site. One thing I found [chuckles] fantastic personally but also kind of funny, is that I already mentioned my favorite sport being the modern pentathlon. And I don’t know if you guys are familiar with this event or this whole sport.
JAMES: I’m not.
JACQUI: The modern pentathlon is really not that modern. [Chuckles] It’s this event that was created at the end of the 1800s that was supposed to be the ultimate test of an athlete, but the ultimate test of an athlete, or a soldier as of 1880. So, think Teddy Roosevelt and the rough riders, or things like that.
JACQUI: So, the first thing you have to do is you have to fence everyone else. So, what happens in the games is that all the athletes are paired up on different mats. And they all fence at the same time. And one touch, you’re out. Then you have to go and swim. And after that, you pull yourself out of the water and you have to ride an unfamiliar horse, which is the part that really caught my eye, the [documentation].
AVDI: That is awesome.
JACQUI: Yeah. And then after that, you run and shoot, though sadly not at the exact same time.
JACQUI: So anyway, the modern pentathlon was the first test event that was actually happening at Greenwich Park that our systems were developed enough before the games to take part in. And so, I was sitting there worried that we were going to just find a lot of bugs, but also just really curious. How does this work? We had it on TV. I did not really know what the modern pentathlon was at that point. And I’m looking at the data coming in and I start noticing some words that I’m not… well first, I start noticing words in English just coming through the data, which was unusual. And it turned out that what they do is they send you these biographies of the horses, of the unfamiliar horses
JAMES: Ha! Wow.
JACQUI: And [chuckles] they sound like personal ads. They’re a bit risqué.
JACQUI: They’re really weird. And so, I brought all of this up because you were asking about the marriage of data and our reporting. So, I’m pretty sure, maybe you can fact check me on this, but I looked and we didn’t do much coverage, or maybe no coverage of the modern pentathlon for the previous summer games. But we found this data so fascinating. And the Interactive team has a really good relationship with the sports desk at the Times. And so, I would talk to reporters about this stuff. And I started showing them these horse biographies. And they ended up writing I think about five different stories on the modern pentathlon in 2012. And I got my first credit for contributing to reporting. And I eventually was even sent to cover the last event of the games myself in Greenwich Park, which was the women’s modern pentathlon. So, there are all different ways that we will collaborate from the more tech side to the traditional reporting side. Sometimes that also involves just inserting some of our, whether it’s an interactive or a statistical sort of listing of results, not just from the Olympics but also from presidential elections and midterm elections, into our stories just to enhance the coverage.
JAMES: You mentioned fact checking right there. Is that some of the ways collaboration happens? Is there any merit to using these data streams to do some fact checking of traditional reporting or things like that?
JACQUI: Yeah, absolutely. So, in the case of the Olympics, again during the games while we’re getting the data in, sometimes before what the announcer at the venue or the Olympic broadcast channel, which is this internal news network that they set up at the games, before they would report a world record perhaps, I would see it in the data as it comes in. So, that’s one way. Or just to reference times or the order of rankings and things like that in a story. Yeah, you could absolutely use the data for that.
JAMES: That’s cool.
JOSH: So, do we have more to talk about the Olympics?
JAMES: No, let’s move on. Let’s talk about crazy R&D.
JOSH: Yeah. So, you’ve been working in Go, right?
JACQUI: I have, yeah. I have been learning and [letting] Go.
JAMES: That’s it. This call’s over.
CHUCK: The Go [gobes].
JOSH: Yeah, how’s that going?
JACQUI: [Chuckles] Yeah, it’s a funny name for a language. It’s hard to google anything about Go. Go [inaudible].
JAMES: Yeah. You would think that they…
JAMES: Ironically, yeah. [Chuckles]
JACQUI: [Laughs] Right, right, right.
JOSH: So, you built the whole stream processing thing for data. And I found the whole write-up of that really interesting. And then I found the whole thing of this, “Hey there’s this OpenNews website. What am I looking at here?” Can you just tell us about this OpenNews website?
JAMES: I was wondering about that, too.
JOSH: Yeah, what is that?
JACQUI: Sure, yeah. And I thought that would be a cool thing for listeners of this podcast to know about, that not only are there developers going into the whole journalism world. There are also now these sources of information online that you can read about some of the more interesting projects that those developers are doing. So, the OpenNews project is a joint collaboration between the Knight Foundation and Mozilla. And they have a few different things online that shed some light on the industry, including Source, which is where a bunch of my links was hosted. So, Source really just aims to provide a list of what are the current events related to news and technology, what are people doing? They get people from the industry to write up projects like, I’ve written up a few of the projects that I’ve worked on at the Times on there. And then they also have a whole learning section that’s meant to help people whether they’re coming from the technology angle and wanting to get into journalism, or vice versa. How exactly do you do that? Because it’s a pretty new field and schools are only now starting to try to address it. So, that’s OpenNews. And they also sponsor fellows every year. So, they’ll take in applications and then the people that they pick get put in different news rooms around the world, in South America and Europe and the US.
JAMES: That’s interesting.
JOSH: But that’s cool, that there’s that whole community and industry segment. That sounds pretty worth checking out. Okay, so yeah. So, you gave us a couple of links to things that you put up on the Source on OpenNews. And one of them was this stream processing stuff that you did.
JACQUI: Right. So, that project Streamtools, it’s an open source project and it’s actually how I got into R&D in the first place. Towards the end of last year, I switched my focus from producing interactive content or doing things like election coverage or Olympics to a more internal facing one, which is news room analytics. And that’s a whole other can of worms. We can talk about that for a long time, too. So, I’ll just say that it’s a way for the news room to try to get to understand its audience more because we don’t produce just a single paper that gets sent out to everyone. We can produce different experiences for different people. So, how do we do that? Well, one initial step we can take is trying to understand who is coming to The New York Times, how they’re coming to The New York Times, whether it’s on the website or a phone or a tablet. And so, when I started to do that, I of course needed to start seeing what data we had available on activity on the site and the apps. And what I found was another group inside the Times had written their own suite of tools to consume that sort of activity. But then unfortunately when it came to trying to query it, there was nothing. The data was on S3, gzipped on S3, and the files were, I eventually figured out that the files were named by timestamp and some kind of instance ID from EC2. So, there was no way to understand what those files contain. So, I started going down the road of just writing scripts to automate downloading and unzipping and all of that stuff, but that all took a long time. And someone told me that R&D had been working on this tool that was all about consuming data. So, I started hanging out in R&D. And when I saw Streamtools, I tried it out on this particular stream of data. And it was great and I just found the whole thing fascinating. I’ve been doing Ruby for, oh my gosh, I think I started doing Ruby at the end of 2006. And I really had been wanting to learn another language just to get some variety. And I had been poking around with Clojure but I didn’t really have a use case for Clojure. And then along came this data problem and Streamtools, which is all written in Go. So, the idea of Streamtools was really my colleague’s, Mike Dewar. And him and Nik Hanselmann in R&D had been working on it for I guess a couple of months when I joined. So, that’s how I ended up working in R&D in the first place. But it’s something that you can interact with just on the command line with curl. It’s all over HTTP. But it’s also a visual programming language in your browser. So, how it works is basically you, it’s based around the concept of blocks and connecting those blocks. So, a block can be an input block, like a block that will read data off of a queue like SQS. Or it will accept HTTP POST containing data. And then there are other blocks that can be used to manipulate the data, whether to filter it or to rename some of the fields in there. And then there are the output blocks. And so, we have a fairly good coverage I think of the different kinds of inputs and outputs that you might want. But there are still plenty more that could be written. We have some blocks that will output into elastic search, RabbitMQ, readers and writers. But the hope is that, it’s an open source project and the hope is that it will allow people to work with real-time streams of data in an easier way. And if there isn’t a block for the particular scenario that you’re dealing with, you can write one fairly easily. So, for me that’s involved learning a new programming language and then writing it in a way that might make sense to other people who don’t even know me and building a framework that would make it easy for people to contribute to it.
AVDI: When I was looking at the blog post on that framework and particularly how you program, you hook the blocks together visually, it reminded me a lot of some work I did in LabVIEW. Really interesting model where you just define inputs and then that drives the rest of the system.
JACQUI: Yeah, absolutely. Oh, you know, I should send you guys a link to this great blog that we found that’s a roundup of all the visual programming languages going back to 1963, I think the first one was. They actually just added Streamtools to it, but all of those previous projects were a big source of inspiration. And we would talk a lot about what worked and what didn’t work in the tools that we had used before.
JOSH: So, this is basically a dataflow system, which I think is awesome. And those things are pretty cool because okay, it’s a nice way for describing essentially a big data problem. Is this stuff targetable to something like Hadoop? Can you take the Streamtools stuff and map that onto a distributed network?
JACQUI: Yeah. I supposed you could. Right now we’re focusing on our, first having it be useful for the news room. The news room is already starting to work with it actually. So, this is a case where R&D is looking forward but also working in the present, too. So, that’s kind of cool. So, we’ve gotten a few contributions from people on the Interactive News Desk in the form of pull requests on GitHub. But they’re also just now starting to use it to replace some of the more programming intensive parts of projects that could just be done with Streamtools. I’ve actually run the Olympics data through it. [Chuckles] So, there are ways that it can simplify that process, too. JOSH: That’s good. So, does the Olympics data actually qualify as big data in your opinion? Or is it still just medium data?
JACQUI: Yeah, I mean I don’t know what the [chuckles] what the limits are, or whatever for.
JAMES: The magic cutoff is?
JACQUI: Yeah, the magic cutoff. I wouldn’t have said it’s big data, whatever big data is [chuckles], because it’s a lot of data in esoteric format. And by esoteric I don’t mean that XML by itself is that esoteric. It’s more that it’s just hard to decipher.
JACQUI: But big data is more than several terabytes, even. I think that’s what most people mean. JOSH: Right.
JAMES: It seems like something like Streamtools, correct me if I’m wrong, but it seems like what you’re doing is lowering the barrier to entry for playing with different kinds of data. You have these pre-pluggable pieces that handle a lot of common inputs and stuff. And then all these ways you can mix and match them. Maybe you have to define one block, as you said, that’s a particular bit for whatever you’re playing with now. But it seems to lower the barrier of entry to hooking up different data streams and seeing, is this something we can use and do productive things with? Is that one of the goals?
JACQUI: Yeah, absolutely. Lowering the barrier both in terms of who can actually work with those sorts of data streams, so you don’t necessarily have to be a programmer to use Streamtools. You can use it in your browser and connecting blocks as they currently exist, because there are some little configurations that you can do in the browser like pointing it at the URL of your queue on Amazon for instance. But the other barrier that we were hoping to lower is on the developer. When you’re working with, whether it’s a new stream of data or an existing stream of data that you want to explore differently, just being able to make changes to how you’re processing it live without having to even restart anything can just make the whole development process so much faster, easier.
JAMES: That’s a good point. I saw in the write-up they had a neat example where it’s obvious that a tool like Streamtools is ideal for things that are pushing information to you. So, Twitter streaming API or something that delivers posts like the [Olympics state] or whatever. But in the example, it was something that you just had to poll every so many seconds. And Streamtools handled that just fine by having this ticker that emitted a time every so often. And then you hook that up to the thing you want to do so that every time the time comes in, you could reach out to a website and grab the new data or whatever. So, it was interesting how it even let you do things like that, I thought.
JACQUI: Yeah. So, there are ways that we come up with to make non-streaming data act like a stream. But with the forward-looking part of this project is that we think that there will be more streams of data coming over the next couple of years. But for now, for dealing with data that isn’t necessarily coming to you like the Twitter Firehose, that’s your example, yeah there are ways you can make it behave like a stream.
JAMES: That’s cool.
JOSH: So, one of the things you said in the write-up about Streamtools was you talked about how it gives you a conceptually different outlook on how you analyze data. And that it’s much more about the instantaneous velocity, what’s happening right now, as opposed to piling up all the data for the last year and extracting something from that amass of data.
JACQUI: Absolutely. I think you’re referring to the write-up that my colleague did on Streamtools, I did.
JOSH: Yes, right. Yeah.
JACQUI: Yeah, no absolutely. It’s a shift in the way that I’ve thought about dealing with any sort of stream or feed of data. I would usually have to download a data dump or wait for an amount of data to accumulate before I could start doing things like checking for patterns or processing it or whatever. But this way, you can actually just start inspecting it in real-time.
JOSH: And does that give you a di-… do you find yourself approaching the software side of that differently when you’re building things for that sort of outcome?
JACQUI: What do you mean with software?
JOSH: That’s maybe a pretty vague way of putting it. I’m thinking that when we work with data in Ruby, we’re doing our ActiveRecord queries to look in the database to go do a big fetch and find all this stuff out. The streaming stuff, it’s like, well there’s stuff that we need to know right away. And it was just, I’m not doing a good job of articulating this. It just seems like when you’re approaching an analytics problem, at some level that’s what we’re talking about, you’re analyzing a lot of data, that it’s what’s happening right now versus what happened last month. And I don’t know. I’m not able to complete this thought. [Chuckles] So, maybe this is [inaudible] at all.
JACQUI: No, I think you’re right. There is something there in that you don’t have to wait for whatever piece of software to process this. You’re getting data in real-time, let’s say. And typically you’d have to receive it, parse it, manipulate it in some way to put it into some kind of data store, whether it’s a MySQL database or whatever. And then you could start analyzing it. This changes that whole pattern. And yeah, does that sound about like what you were talking about?
JOSH: Yeah. That’s basically what I was asking.
JOSH: Are there just certain libraries that make that easier or different? I see that you wrote this stuff in Go, so obviously Go is built for a certain kind of software. So, that seems like maybe a good fit.
JACQUI: Yeah. Go makes it easier to have network concurrency and to do things like make different channels and broadcast data across different channels and receive data on other channels at the same time. Yeah. So, I think in the end Go, while moving from working with Ruby to a compiled language like Go, I’ve had to really step back and rethink the way I approach programming a lot. In Ruby, I think we get a little spoiled with just how dynamic it is and how you can just extend and mock out things so much easier. But Go on the other hand is very fast. And it’s very good at things like concurrency and dealing with data and streams and things.
AVDI: I’m actually a little curious about the architecture there, if you don’t mind talking about that. With the blocks in Streamtools, do they map to a single Go routine or a family of Go routines? Or does it not break down that simply?
JACQUI: No, it does break down like that. So, each block is, we refer to as our library of blocks, and so we have over all of them, there’s a block manager that will manage the request coming in and going out of the blocks and going from one block to another or going from one block at the end to the final output, wherever that is. But each block is a different file, its most basic level in the library of Streamtools. And sometimes the blocks actually will spawn their own Go routines. And sometimes they just follow a very simple pattern.
JAMES: That’s interesting.
AVDI: Okay. So, the UI is in the browser.
JACQUI: The UI is all in the browser, yeah.
AVDI: I hadn’t paid enough attention to that part.
JAMES: I’ve noticed just at my current jabber, we have a lot data flowing through the system. Just getting some kind of visibility on that data flowing through can change the way I think about it. Once I can see the shape of it and what’s happening in real-time, then oh, I can see this is the pattern or whatever. And I imagine tools like Streamtools make that easier to do.
JACQUI: Yeah. Part of the idea of it is that you could use it as, I guess like a prototyping tool or that will help inform your decisions on how you’re going to store the data or how you’re going to interact with it in the end. And then on the other hand, maybe you pipe data through something like Streamtools and you realize, oh I don’t actually have to store this at all, because you get to have that sort of overview and understanding of it in real-time.
JAMES: Alright. That’s cool. So, taking the conversation in a different way, direction, how does The New York Times, as they get more into this development stuff, how do they get – you mentioned data reporters or whatever this kind of concept of a programmer/statistician/reporter thing – how does The New York Times get rolling with something like that? Do they just hire a few developers? What’s the process there?
JACQUI: How did we get started with that?
JAMES: Yeah, yeah.
JACQUI: Aron Pilhofer who, before I joined R&D was my boss on the Interactive News Desk, he started out working on the desk that has possibly my favorite acronym in all of the world, CAR, which is Computer Assisted Reporting, a term that [chuckles] a term that dates back to at least the early 80’s. I’m not sure how far back it goes. But the CAR desk, as it’s typically called, or Computer Assisted Reporting, it was one of the first steps towards merging journalism and technology. So, those people are traditionally the ones that will do analysis of very large data dumps, like maybe from the census, or any time there’s a freedom of information act request for, I don’t know, a certain state’s healthcare records for their hospitals for the developmentally disabled. That’s another example. When those sorts of data sets get delivered, it’s usually the CAR desk that will handle them. So anyway, my former boss Aron was working there and he saw with the way the texting and the internet and everything like that was just changing and how businesses were all online, and of course the Times had a website but our reporting was pretty separate from doing anything on the web and that kind of thing. So, he started the Interactive News Desk back then in I guess 2006, 2007. And how do you start hiring people for a job that hadn’t really existed before is you get journalists who are either a little adept with technology or interested in learning technology and bring them on and help them get up to speed on something like web development with Ruby. Or you get people who are already experienced programmers and developers and get them up to speed on what it means to be a reporter and what the different concerns are and priorities for doing that kind of software development in the news room compared to at a startup, let’s say. It’s not always possible or even a priority to have a very high amount of test coverage and an automated test suite for something that has to go out in an hour. So, it’s a different, I guess it’s a different sensibility. So, we tend to hire people who are in one world a little more than the other. And they learn on the job.
JAMES: It’s really fascinating. So, you’re saying that a lot of the development you do is chalked up more to simple scripts to massage a certain data set or something and not so much the long-term process that you would want to have a robust test coverage on and stuff like that?
JACQUI: Yeah. And very often though, the thing that you hacked together to meet a deadline is successful and people want more of it. And you’re having to repeat or make some code generic and repeatable. And then you want to start, it’s probably a good idea to start putting in some test coverage and things like that. But figuring out which of those things you hacked together in an hour is going to take off and get a lot of attention within the company, that’s pretty hard to do. So, you try to write good code because again, you don’t want to get something on the front page of The New York Times and have it just be wrong or error out. So, it’s trying to find that balance between good software development practices and journalistic sensibilities, too. I guess what I would say is that one of the main differences that I’ve found is that the focus isn’t on the technology itself working here. The focus is on the results. So, while you can talk about how cool this queue system that you found is, or comparing the merits of something, Sidekiq to Resque or whatever. In the end, no one cares [chuckles] which one of those things that you use, or even that you use Ruby or whatever to produce this thing, when they’re coming to read the news. No one’s going to care if your test suite runs 30 seconds faster than yesterday. It’s really just more about the results and less about how you got there.
AVDI: I’d like to expand on that a little if we could. I’m fascinated by the picture I’m getting of the environment that you work in, because it really sounds like both the Interactive News department where you have all these very quick turnaround one-off jobs to do that might just be a day of work, or with the R&D stuff where you can play around with ideas – I don’t think we talked about it but some of the other background reading I did, there were some interesting articles about constructing just little bots that do interesting things with data or with Twitter and stuff like that – it really sounds to me like you work in a very idea-rich environment. It seems like there are a lot of opportunities for coming up with an idea of something that you can quickly do with data and then doing something useful with it, doing something that enlightens people or amuses them, if nothing else. And honestly, I envy that. And that’s not something that everybody gets. Even a lot of people with regular programming jobs, a lot of times every day it’s the same idea. Do I have to go to work for The New York Times to work in an environment like that? Or are there ways that I could expand my horizon and, I don’t know, have that kind of interesting data around me where I could think, “Hey, I could put two and two together and make something cool in four hours”?
JACQUI: Well, if you’re interested in working for The New York Times I do believe we are hiring.
JACQUI: If you want to come and save journalism.
AVDI: Well of course, I ask on behalf of the listeners.
JACQUI: [Chuckles] Sure.
JAMES: Avdi may not even be a Rogue by the end of this episode.
JACQUI: No, I think you’re right there. It is an idea-rich environment here. And part of that comes from the group of people that the Times has gotten together, the very talented group of people. But it’s not just The New York Times that has talented people of course. The other end of it I’d say is out of necessity. The newspaper industry, how many of you have actually bought a print copy of the paper recently, of any paper?
JAMES: I have not.
JACQUI: So, we have to innovate.
JOSH: Bought? You mean like paid money?
JAMES: [Chuckles] Yeah.
AVDI: We have.
CHUCK: They still make those?
AVDI: We have, but I hesitate to say for what purpose.
JACQUI: Right, for things other than packing or moving boxes and stuff, right? So, the industry has to change and has to innovate and has to come up with new ideas. So, it’s a necessity, too. But to your question about do I have to work at The New York Times to do that, no, of course not. And not everyone can even work or even wants to work at a media organization. So, there are just so many available data sets out there that, as far as I know, no one’s really exploring or doing anything with. Or someone could be and I just haven’t heard about it and that’s entirely possible, too. But if you just… I’ve been going on this website, ProgrammableWeb, which has a pretty good listing of available APIs and they’re updating it all the time, just to see what would happen if I take data from the USGS, the Geological Services, on earthquakes. If I can consumes that, and maybe here’s an idea, mash up that with analyzing what news stories are getting produced in relation to events that are happening, I guess you could turn them as disaster events or whatever, and see how that impacts what people are even talking about on Twitter. Twitter has an API, right? So, I think there’s just a lot of possibility out there for either doing analysis of single sets of data or mashing up different sets of data as well.
JOSH: So, it sounds like…
JAMES: I have to bring this up, because you sent us this awesome article. It was one of the ones I read before this episode and I didn’t really appreciate it until you just talked about that. But somebody had made a bot at one point that was The New York Times minus context.
JACQUI: [Chuckles] Yeah.
JAMES: It is a Twitter bot, @NYTMinusContext. And they had taken these funny sentences from inside New York Times articles but had removed all the context. So, you had no idea where these were coming from. And so, there were expressions like, “scrum of euro kissing and plastic surgery expressionism,” is one of the ones that’s mentioned in the article. And then people in The New York Times begin to find this feed. And they were like, “Wow, what article was this?” And so, they would hunt through trying to find them. And so then, you ended up making a bot, @NYTPlusContext which would take these quotes, go figure out which article they came from, and then tweet the link to the article to put the context back. [Chuckles]
JAMES: It was great.
JACQUI: Yeah. Originally, it was just the idea for that was from a place that I think a lot of good ideas from programmers come from, which was laziness. I wanted some way of just finding the sources of all of those quotes. Because they’re really intentionally off-the-wall sort of quotes that you wouldn’t really necessarily think would show up in The New York Times, right? But then, it’s become this serendipitous way of reading the content that the Times produces. And I’ve gotten quite a lot of feedback from people on Twitter. Mostly say that this is actually just how they read The New York Times.
JAMES: [Chuckles] That’s awesome.
JACQUI: They just go between MinusContext and PlusContext. And that is how they read the paper now. So, that was a lot of fun, building that. Some people, I actually got some negative feedback, too. That was ruining the serendipity or whatever by adding the context back, which is why it doesn’t just publicly tweet it. It actually just replies to @NYTMinusContext. So, my hope was that if you didn’t want things spoiled, if you can call that spoiling it [chuckles], just don’t follow the bot, right?
AVDI: Well, and I note that the @NYTMinusContext account actually references PlusContext too, in their bio.
JACQUI: Yeah. We’ve exchanged a couple of emails, too. He’s been very gracious about that.
JOSH: Nice. So Jacqui, just backing up a few moments, it sounds like the tools, or the software tools, are coming together or at least becoming available for citizen journalists to become data journalists. It seems like that’s a lot of what you’re talking about here. And that there’s a big opportunity for people to step up and get their hands on one of these data sets or some of these data sets and start doing reporting on the information that is available there.
JACQUI: Yeah, sure. The government as well, at both the national level and at the state and even city levels, like in New York we have our own dedicated site to data sets that are available for the city, and of course there’s data.gov. There’s more and more opportunity I’d say for people to find out things that the government is tallying in various ways. Congress long had different kinds of access to data. I think the congressional data is getting better. And there are different APIs that you can use to keep tabs on what bills are coming through in the House and that kind of thing. So yeah, I think there is opportunity for non-official, I guess reporters to inspect the data that’s available. Absolutely.
JAMES: What I find most interesting is along the lines of what Avdi said, envious of your process. And I think it’s that by being at this interesting junction of all of this data, you have these opportunities. And so, because of that, it’s almost forced you to a near-ideal programming cycle in some ways, in that you have to do this [spike], something on a deadline, that you throw out real quick and see how that goes. And then like you said, you have no idea which ones are going to catch on or whatever. But then once you already have that data, then you can go back and be like, “Alright, let’s turn this into a real system and build it the way we would have to do it to make it robust and maintainable,” or whatever. And who cares if you have to totally redo this [spike] work? It’s an hour or two of work or whatever that you do to get it right. And knowing what you know now, it’s nice. Whereas in the other world, we sometimes make the mistake of pouring large amounts of development effort into something we don’t know is going to be useful and building it robust from the get go. And you have the luxury of being able to test it first, and with the minimal amount of output. I think that’s great.
JACQUI: Yeah, that’s a good point. But keep in mind that once you’re having to build it into a more robust system, as the saying goes, the news doesn’t stop. And there are new projects coming in as well that you’re going to have to, that will take away from your ability to build out that robust platform.
JACQUI: But generally speaking, yeah I think you’re absolutely right. That’s a very positive way, I think, to look at that process. Because as I think you guys all know, and certainly I would imagine most of the listeners of this podcast know, building quality software is difficult. And it’s something that you have to give consideration and time to. So yeah, so just being able to test something and get a lot of, whether it’s users or interacting with it in a short amount of time and seeing what works and what doesn’t work, can definitely help inform the decisions you would make in building out a platform, absolutely.
JAMES: It’s been super cool to hear about the process at The New York Times and stuff. It’s very interesting. What cool times we live in.
JOSH: It’s very awesome.
CHUCK: We going to start making times jokes now? We’re up with the times.
JAMES: Oh, oh, nice.
JOSH: Sounds like it’s times for picks.
CHUCK: Alright Josh, what are your picks?
JOSH: [Chuckles] Okay. Thank you for throwing that right back in my face.
CHUCK: No problem. I practice every day.
JOSH: Okay. So, let’s see. Conferences. I last night discovered that there is this amazing conference going on next month that I really want to go to. And it has nothing to do with programming. It’s called ‘Buffy to Batgirl’.
JOSH: [Chuckles] Yes. [Chuckles]
JOSH: And I really want to go to it. It’s at Rutgers University next month. And the program for this conference is just amazing. It’s like, “What’s In the Basket Little Girl?: Reading Buffy as Little Red Riding Hood”. Things like, “Warrior Women of the 1980s”, “Heroine as Huntress: Images of Female Archers in Comics, Fantasy and Science Fiction”. It’s just all feminism applied to the kind of pop culture that I love. So anyway, I doubt I’ll be able to go, but I really want to. So, I hope somebody else goes and tells my all about it. [Chuckles] That’s my dream pick this week. And then I have a very cute little pick, the Unix command line tool expand. So, I wrote a little script. Somebody gave me a project recently that had tabs in the source code.
JAMES: Boo. Ouch.
JOSH: Who puts tab characters in source code? What is this, 1970s? So, I discovered there’s a Unix utility called expand.
AVDI: You realize there’s a Go programmer in the room, right?
JACQUI: I hate you now. [Chuckles]
JOSH: Hey, I’m a Smalltalk programmer. We use tabs in Smalltalk all the time, too. But anyway, so expand will convert tabs to soft tabs, basically runs of spaces. And it’s a lot smarter than a regular expression that just turns tabs into spaces, because if you have partial tabs, if you start your tab in the middle of a run of spaces, it makes everything line up good. So, the expand utility in Unix. So, that’s my picks.
JACQUI: That sounds really useful. I would hope that everyone has their text editor set up to not use tabs and use spaces instead. I do all my code editing in vim and expand tabs and stuff. But it’s cool. So, that’s all in the command line. You can just run that on the command line.
JOSH: Yeah, expand –t 2. That will convert your tabs into soft tabs.
JACQUI: Two spaces.
JOSH: Yeah, so I just wrote a little Ruby script that found all of my files and ran expand on them and I just checked all that into Git.
JACQUI: Oh, useful.
JOSH: Okay, I’m done.
CHUCK: Alright James, what are your picks?
JAMES: I’m still busy reading the man page from Josh’s command here.
JAMES: How did I not know this exists? It’s very cool. Okay, two picks. First off, it’s time for Rails Girls Summer of Code 2014. They’re doing a campaign right now to get funding to help as many students as they can to go through this process and improve open source. I’m pretty sure we’ve talked about this in the past. We’ve definitely talked about the various kinds of Summers of Code. But I think this is important. It’s super cool. So, we should all help out. And I’m going to put a link in the show notes to that. Please go and help out, even if you’re just volunteering to be a coach or something. That’s helpful. If you can afford to give money, that would be even better. Just in whatever way you can help out, that would be great. The other thing that I want to mention, we had this funny exchange before the show over what R&D is. And I have this broken version of R&D in my head due to the show Better Off Ted. And if you’ve never watched Better Off Ted, it’s on Netflix streaming. So, it’s easy to pick up there. And it’s about this company, Veridian Dynamics that does all kinds of horrible things to their employees. And these two scientists that work for them and pull off all these fun projects inventing scented light bulbs and edible moss for NASA and just all kinds of zany things. It’s kind of Arrested Development-esque I would say. And it’s humor. So, if that appeals to you at all, you would probably enjoy it. So yeah, Better Off Ted. That’s my other pick.
CHUCK: Awesome. Avdi, what are your picks?
AVDI: Let’s see. I’ll start with a tool that I found really useful recently. I was working on my publishing tool chain and I realized I wanted to be able to take a finished pdf and extract a few pages from it and assemble those into a different pdf. And some folks recommended a tool called PDFtk whose name first threw me because I was thinking of the tk widget tool set. But no, it turns out it’s a command line program that is great for manipulating pdf files. You can do a whole lot of stuff with it. But one of the things you can do is you can basically just tell it, okay take pages 5, 17 through 23, and 50 through 59, and extract those out and put them together into a new file. And that’s just one command. And so, yeah, PDFtk, very useful if you have need for that sort of thing. There is a blog post that I really, really enjoyed recently. It’s called Letter to an aspiring developer by Brandon Hays. And I’ve read a number of this style of blog posts over the years and I’ve even written one or two myself. But this is really one of the best ones in this vein that I’ve ever read. He’s got some really fantastic advice in here. And finally, I bought a new backpack recently. And longtime listeners who are familiar with my picks will probably not be surprised that I bought it from Tom Bihn. And I am so pleased with it. It is the Synapse 19. They also have a Synapse 25, which is a bigger version of it. But it’s basically a daypack-sized backpack. And I’m not going to go on and on and on about what all makes a backpack well-engineered. But this one really is. Just one example, lots of backpacks have a place to put a water bottle. This one has a place that fully fits my 750mL water bottle. But most backpacks will do things like stick them on the side somewhere. This one puts it dead center at the very back. So, the full water bottle neither throws the balance of the backpack off to one side, nor does it dig into your back. So, lots of little design decisions like that that make it really nice. And that’s it for me.
CHUCK: Very nice. I have to +1 the Brandon Hays thing for two reasons. One is that I remember talking to him on a semi-weekly basis when his whole goal was to get paid to write code. So, this was back when he was a marketing guy at a company and wanted to get into coding. And the other reason is he’s one of the most genuine people I know. And so, everything he writes is awesome. So anyway, I’ve got a couple of picks here. The first one is Vagrant. I know we’ve talked about it on the show before, but I got in and was doing some stuff with Vagrant to figure some stuff out for one of my clients. And it has been really, really, really nice. I’ve also been playing with Graylog2 and Kibana. And I really like those for visualization on log stuff. Still trying to figure out which one I want to use. But I’m liking both of those. So I’m going to pick both of those. And I’ll throw it over to Jacqui.
JACQUI: I just want to +1 on Kibana. I’ve been really impressed with the Kibana dashboard. And yeah, I haven’t worked with Graylog2 though. But okay, my picks. I thought if listeners were more interested in hearing what we’re doing in our R&D Labs at the Times, I just want to quick plug our website. We have a blog that we write up our current experiments and more finished products in. And that’s at nytlabs.com. And if you’re interested in hearing from another organization to explore how they’re thinking of news coverage in terms of the production of news coverage and breaking down stories into either their entities or their themes, you may have heard of the BBC. The BBC has a pretty fantastic series of R&D Labs including the BBC News Labs. And one of their members wrote up this piece that I just found on Medium and it’s called ‘Storylines as Data’ at the BBC. And it’s a pretty fascinating approach to breaking down all of the BBC’s news coverage. So, if you’re interested in hearing more about that, you should check it out. But my favorite thing that has happened on or related to the internet this year was a lightning talk at a conference I was at last month. It’s the annual conference for programmer journalists. It’s called NICAR. Again, it incorporates my favorite acronym, CAR. But basically a friend of mine, Jeremy Bowers who’s a programmer journalist at NPR and a former debate team member and coach, I guess, going back from high school and through college, decided to try to explain the entire internet in five minutes. I think he ended up speaking about 350 words per minute or something like that. But he goes into everything from the backbones of the internet to what is TCP/IP and how does that work, and how does that relate to the website. So, there’s a video of him giving this lightning talk. And then there is also his entire presentation slides which involve plenty of kittens and things like that. And the conference was NICAR. I think that’s the National Institute for Computer Assisted Reporting. It’s pretty entertaining though, to watch.
JAMES: That sounds awesome.
CHUCK: That does sound awesome.
JACQUI: Yeah. And if I have time for one more pick?
CHUCK: Go for it.
JAMES: Yeah. Go for it.
JACQUI: Okay. I don’t know if you guys are familiar with this project to make the film version of Dune, the Frank Herbert book, but never actually ended up getting made. But it was one of the most amazing premises for a movie I’ve ever heard. The director is Alejandro Jodorowsky. And he’s done a bunch of really weird movies. But back in I guess, it must have been the 70s, the mid-70s, he decided he wanted to make a film version of Dune. This was before David Lynch ended up doing it. And there’s a documentary out now just all about this project. He had cast Salvador Dali in it, and Pink Floyd was doing the music, and Giger and Moebius were doing all of the graphics. And they ended up coming up with this 200-page storyboard book that got sent out to all the studios. And you can see how it’s informed plenty of other movies that have since been made, even though that movie itself was never made. It’s pretty fascinating and weird. And I like weird. [Chuckles]
JAMES: That’s awesome.
JACQUI: Yeah. And those are my picks.
CHUCK: Awesome. Well, thanks for coming. It’s been a terrific discussion.
JACQUI: Oh, thank you for having me. It’s been a lot of fun.
CHUCK: Fascinating stuff going on over there.
JAMES: Yeah, it’s cool.
JACQUI: Thank you.
CHUCK: Quick reminder, our book club book is ‘Object Design’. We’ll be talking to Rebecca Wirfs-Brock toward the end of May. So, keep an ear out for that. Pick up the book. Go read it. We did look around and it looks like the best way to get it is on Safari Books Online. So, just be aware of that, because it is out of print. So anyway, I think that’s it. So, we’ll wrap up and we’ll catch you all next week.
[A special thanks to Honeybadger.io for sponsoring Ruby Rogues. They do exception monitoring, uptime, and performance metrics that are an active part of the Ruby community.]
[This episode is sponsored by Codeship. Codeship is a hosted continuous deployment service that just works. Set up continuous integration in a few steps and automatically deploy when all your tests have passed. Codeship has great support for a lot of languages and test frameworks. It integrates with GitHub and Bitbucket and lets you deploy cloud services like Heroku, AWS, Nodejitsu, Google App Engine, or your own servers. Start with their free plan. Setup only takes three minutes. Codeship, continuous deployment made simple.]
[Hosting and bandwidth provided by the Blue Box Group. Check them out at BlueBox.net.]
[Bandwidth for this segment is provided by CacheFly, the world’s fastest CDN. Deliver your content fast with CacheFly. Visit CacheFly.com to learn more.] [Would you like to join a conversation with the Rogues and their guests? Want to support the show? We have a forum that allows you to join the conversation and support the show at the same time. You can sign up at RubyRogues.com/Parley.]