054 JSJ JavaScript Parsing, ASTs, and Language Grammar w/ David Herman and Ariya Hidayat

Download MP3

Use this link and code JAVAJAB to get 20% off your registration for FluentConf 2013!



00:48 - David Herman and Ariya Hidayat Introduction


Next Week

Web Developer Skills


JAMISON:  I am Linus Torvalds and I pronounce Linux, Linix. [Hosting and bandwidth provided by the Blue Box Group. Check them out at Bluebox.net** .] ** [This episode is sponsored by Component One, makers of Wijmo. If you need stunning UI elements or awesome graphs and charts, then go to  **Wijmo.com** and check them out.] CHUCK**Hey everybody and welcome to Episode 54 of the JavaScript Jabber Show. This week on our panel, we have Tim Caswell. TIM:  Hello. CHUCK:  Jamison Dance. JAMISON:  Hi guys. CHUCK:  Joe Eames. JOE:  Hey there. CHUCK:  Merrick Christensen. MERRICK:  Hey guys, what’s up? CHUCK:  I’m Charles Max Wood from DevChat.tv. And we have two special guests this week. We have Dave Herman. DAVID:  Hey there. CHUCK:  Ariya Hidayat. ARIYA:  Hello everyone. CHUCK:  And these guys are so smart that we brought them back. So, if you’re interested, we’ll put links to the episodes that they were on. David was on when we talked about his book ‘Essential JavaScript’ and Ariya was on when we talked about PhantomJS. JAMISON:  Effective JavaScript. CHUCK:  Effective? What did I say? MERRICK:  Essential. CHUCK: **Essential? Well, it’s an essential book on Effective JavaScript. How’s that? [Laughter]**MERRICK:  Good save. DAVID: **At least, you didn’t say Defective JavaScript. [Laughter]**CHUCK:  No, that’s what I write. I’m really good at writing defective JavaScript. ARIYA:  Actually, there’s a book about Essential on Defective JavaScript. CHUCK:  I also want to announce really quickly that Fluent Conf has given us a discount code. So, if you want to get 20% off on your registration for Fluent Conf, just enter JAVAJAB and you’ll get 20% off when you register for Fluent Conf. Alright. Well, let’s get started. This is going to be a really, really interesting topic and it’s something that I’ve wanted to know more about for a long time. And I just haven’t delved as deeply into it as I would like to. And that is, we’re talking about parsing JavaScript and AST’s and Language Grammars. So, do one of you guys want to actually explain what all of that means? DAVID:  Ariya, do you want to give it a try? ARIYA:  I think you should go first, Dave. DAVID:  I should go first? Alright, let’s see. We’re all programmers. We’ve all seen what source code looks like. But of course, usually, source code is just represented as this big flat string, just a sequence of characters. So, parsing is kind of the first step necessary for a computer to be able to do anything with your code. It starts with this big string and somehow, it has to figure out, what does this string mean and what am I supposed to do with it? Before it starts actually trying to do anything with it, the first thing it needs to do is just try and figure out what the structure of the program is. Just figure out all of its component parts and basically it just converts it from a string into a data structure that’s easier to write code over, that’s easier for a program to deal with. Parsing is that process of taking a string that represents some sort of input format, some sort of language or format and turning it into a data structure. Language grammar is the other half of the title of the show. That’s sort of a way of specifying the particular format of some syntax or another. Each language has its own format. And somehow, you need a way of saying, these are the possible things that you can say in this format and this is what they’re supposed to look like. So, grammar is a way of really precisely saying, this is exactly what the syntax looks like for this particular language. And then, it turns out that the two connect to each other because grammars are so nice and precise that it turns out you can build tools that automatically take one of those grammars and just spit out a parser for you automatically. So, you don’t actually have to implement the parser by hand. There’s many, many, many years of research that went into making that work. But that’s actually pretty stable research that was done back in, I think it started in the ‘60’s. I think it really started getting stable in the ‘70’s and by the ‘80’s, it sort of seen as a pretty well-solved problem. So, grammars are this nice way of describing a syntax. Parsers are a nice way of recognizing a syntax and turning it into a data structure. And then, there’s these things called parser generators that can take a grammar and give you a parser. Does that make sense? JAMISON:  I think that makes tons of sense. You talked about how to use a grammar to describe the syntax. Where to the semantics come from? So, that’s just like saying what are legal strings into a programming language basically, right? So, what actually gives those meanings? DAVID:  Depending on who you ask. Well, in my personal opinion, that’s where the real fun begins in programming languages. Once you get past the syntax and start talking about what the meaning or the behavior of programs are supposed to be. I’m not sure how much we actually want to get into that in this particular episode since we’re talking about syntax. But semantics is kind of what happens after you’ve got that data structure which is known as an Abstract Syntax Tree. Once you’ve done the parsing, now the real fun begins because you actually have to do something with this data structure. And if you’re talking about like a language interpreter, it’s supposed to run the program. Or if you’re talking about a compiler, it’s supposed to translate that program to another language. Semantics is the way that we specify what the correct behavior is or what the correct translation would be for a compiler. And there’s a ton of different ways of specifying semantics. And that’s when kind of the real deep hairy language research kind of kicks in. CHUCK:  We’ve mentioned it a few times but can you explain what an Abstract Syntax Tree actually is? DAVID:  An Abstract Syntax Tree is that data structure that spits out the other end of a parser. So, it’s known as abstract because you can think of the actual syntax that’s in the string as being the concrete syntax for the language. That’s really what it looks like to us as people, as humans. But the computer doesn’t really care. Did you spell function, F-U-N-C-T-I-O-N or did you spell it F-U-N or did you spell it Lambda? Different languages have different ways of spelling these things. But you can think of those as just kind of skins, the sort of the user interface for the programmers. And humans are the only ones who really care about that. But abstractly, underneath, you just sort of have this data structure that says, “Well, this is a function and it’s got so many arguments and it’s got a body,” and stuff like that. But once you’re down at that data structure, you don’t really care how specifically it was spelled in the string. ARIYA: **The easiest way to understand the Abstract Syntax Tree of a JavaScript code is by using the Esprima: Parser demo. So, I build this and it shows exactly the tree of your code. If, for example, I have a code called [inaudible] equal 42, then it will say, “Hey, there’s an assignment expression.” And on the left side is an identifier called answer and on the right side, it’s a literal 42.**JOE:  That’s awesome. ARIYA:  I’m going to send a link later on for the parser demos. DAVID:  That’s great. I think it’s hard just from talking to get a good feel for what the difference is between concrete syntax and abstract syntax but it really helps to look at examples. And Ariya’s webpage is a really great way to just play around with, okay, here’s some concrete JavaScript syntax and what does it look like as an Abstract Syntax Tree? JAMISON:  So, have the various JavaScript parsers pretty much standardized on this AST format for JavaScript? Or is that still… ARIYA:  They all have different formats. If you talk about particular JavaScript engine, it follows internally on how the engine itself is written or the parser portion of the engine. So, Dave came up with this idea of sort of nice AST that was implemented in SpiderMonkey as a reflection, I believe. Is that correct, Dave? DAVID:  Yeah. I mean, I basically took the parser that existed in SpiderMonkey and just provided a JavaScript API that spat out the results of it which was kind of a reflection, so that you could reuse the SpiderMonkey parser in JavaScript itself, so that you could take out a string and get out on an Abstract Syntax Tree. CHUCK:  And SpiderMonkey is the JavaScript engine for Mozilla? DAVID:   That’s right. It’s the one that ships in Firefox. CHUCK:  Are there other ways of seeing what the parser is doing in some of the other ones like Chrome, or Safari, or IE? DAVID:  I don’t think so unless you’re willing to dive into a C++ debugger and peek under the hood. But I don’t think any of the other engines expose their parser to JavaScript. ARIYA:  I think even for the case of SpiderMonkey, reflect is not exposed in normal code, right? DAVID:  Well, it’s perfectly safe. Like basically, what it does is it runs the internal parser but then it takes the internal C++ data structure that would be unsafe and copies it into a safe JavaScript object. So, there’s no security risk there. But it actually is using the internal parser to parse the source code. TIM :  Is this like exposed in Firefox to normal web pages or do you have to have a special add-on? DAVID:  It’s exposed to add-ons, it’s exposed to the browser Chrome, but it’s not exposed to web content. TIM:  Okay. So, extensions can see it? DAVID:  Yeah. ARIYA:  So you can type reflect.parse and then your JavaScript code and then it will spit out the tree. TIM:  But anyone can run Esprima because that’s just written in JavaScript? ARIYA: **Yes, that was the original idea. So, I tried to implement the same parser API in WebKit, in a way it’s [inaudible] core, turned out to be quite hairy because if you tried to match one JavaScript engine parser format to another, that is not so easy. That was a difficult process.**MERRICK :  So, can you guys explain? You touched a bit on the parser piece and I usually hear another word when that comes along which is Lexer. Does Lexer typically come before parser and you guys touch on what that is? DAVID:  Yeah. I have an answer to that. So, most programming languages will typically divide up this process that goes from the source string into an AST, into kind of two parts. And like you said, the first one is the Lexer. The Lexer is kind of like a mini parser. It’s doing less work than the parser is, but what it’s doing is it’s taking the string of characters and actually recognizing kind of little component chunks. You can think of it as, an analogy in human language, you can think of it as the thing that tells you what the words are, and then the parser tells you how the words come together to build sentences. So, the Lexer is going to say things like you wrote obj.foo, that’s really three, what they call tokens, or you can think of it sort of like words. The first one is OBJ which is an identifier, the second one is dot which is a special symbol in JavaScript that has a special meaning by itself, and then the third one is another identifier, Foo. So, the Lexer is just going to say, “I have three tokens. I see three tokens here, OBJ, Dot, and Foo.” And then, the parser is going to say, “I have identifier, dot, identifier,” that actually represents a property selection for my object. And so, it will build an AST node, an Abstract Syntax Tree node that represents that structure. It’s actually possible for some languages not to have a Lexer but it’s rare. Usually, that’s a good way to divide up the work. CHUCK: **So, let’s say that I want to write my own language, and I’m going to be real creative and I’m going to call it blavascript and anyway… [Laughter]**JAMISON:  From Transylvania. CHUCK:  Yeah. And I want to, how would I go about that? Would I just define my own grammar and then use that to build the parser? TIM:  So, I can share how I wrote the CoffeeScript parser, if that’s useful. CHUCK:  Okay. TIM:  So, I was working with Jeremy years ago and he had an existing CoffeeScript parser written in Ruby using some parser generator in Ruby. And the CoffeeScript syntax is actually rather, I think it’s context sensitive. I don’t think it’s possible to properly explain it using one of the formal grammars. And so, I think what we did is we hand wrote a Lexer using regular expressions that just broke up the string into various tokens. JAMISON:  You should define context sensitive. TIM: **A type…I don’t know. We’ll get into that later. [Laughter]**TIM:   But there are many different classes of grammars. I mean, you have a regular expression which is actually a family of grammars. And I’m sure Dave can explain it much better than I can. But the context sensitive ones are much harder to do. They most -- I don’t know of any parser generator that supports those, at least, not efficiently. ARIYA:  I think a loose analogy would be like a human language where different words can have different meanings, even like a full complete sentence under different context can give you totally opposite meanings. JAMISON:  It’s kind of like you can’t just read the word in abstract. You have to read it in the context of the sentence to know what it means. TIM:  Right. And all parsers need some context, like what does this .token even mean? Well, in JavaScript, dot’s actually not that ambiguous but there are other ones. DAVID:  You could put a dot inside of a string literal, for example and that’s a different context where dot means something very different. TIM:  That’s true but the Lexer would probably catch those. DAVID:   Right. ARIYA:  Or decimal numbers. TIM:  There are various, like the square brackets, for example. When I was designing in my Jack language, there are only so many symbols on the keyboard. JavaScript overloads, and most languages overload. Square can be property access or it can be an array literal. And depending on what it means is, how is it used in the syntax? ARIYA:   The classic example in JavaScript is a forward slash because it can start a regular expression. But in some cases, it’s actually an operator. TIM:  Yeah, that’s a fun one. And that would be what I would try to get the Lexer to do but then I don’t know. That’s hard because then you have to backtrack in your Lexer. Certain languages are harder to write parsers for. And CoffeeScript is one of those harder ones. So, when you’re designing your language, if you have the freedom to design the syntax, if you want to make your life easy on you, design a syntax that’s not ambiguous. But those aren’t always the easier to use languages. ARIYA:  Indeed. DAVID:  So, this is one of the classic tradeoffs when you’re trying to -- so, it kind of gets back to, I think it was Charles who asked the question. If I wanted to design my own language, how would I go about coming up with a syntax and coming up with a parser? That’s actually a really subtle design question. And one of the tradeoffs that Tim is getting at is you can sort of design the grammar of the language to be simpler for algorithms to parse. Or you could design it so that it’s sort of more pleasant for humans to look at and easier for them to read. And it turns out that humans are a lot messier than computers. So, our tastes often run in directions that go completely at odds with what algorithms would really like to see. And so, overloading is a great example of where -- humans are really comfortable with brackets meaning different things in different contexts, but that makes the algorithm much trickier because now, it has to use additional contextual information to figure out exactly what it’s looking at in various points. And so, some languages have really gone sort of hog wild with context sensitivity. One of the great examples is the C programming languages where you can’t even tell what basic kind of token you’re looking at without knowing the full scope of the program and all of the type definitions in the program basically due to really fiddly details in the way they design the grammar. So, it has tradeoffs, like on the one hand, people like looking at the C syntax and they find that ambiguity not a problem for themselves and it reads nicely for them. On the other hand, because it’s so much harder to write the tools, it’s actually pretty uncommon for people to go and write a new C parser because it’s so incredibly hard to write a C parser. On top of that, all of these parser generators we were talking about which can take a grammar and automatically spit out a parser so you don’t have to write it by hand, it’s pretty much impossible for any of them to generate a C parser automatically. So, you end up having to write it by hand. So in practice, all of the C parsers are handwritten. Pretty similar with JavaScript because of that ambiguity with the slash character, it’s pretty hard to use a parser generator to automatically give you a JavaScript parser. So, people end up having to write it by hand which means that it takes really smart people like Ariya to actually build a JavaScript parser and they don’t come by everyday. So, it’s a tough tradeoff. CHUCK: **So this leads me to two things. First off, you said that humans are messier than computers. And I have four kids and three computers and I have to agree. [Laughter]**CHUCK:   The second thing is that, is it an all or nothing proposition to use a parser generator? Or can you use it to get you most of the way there and then kind of deal with the ambiguities on a one by one basis? TIM:  So, that’s what I was about to talk about with CoffeeScript, you can’t actually do the grammar using Jison which is the parser generator we use, because it’s just not possible. So Jeremy’s idea was he added a layer between the Lexer and the parser that rearranges the tokens, that gets rid of the ambiguities. Kind of massages it manually. ARIYA:  That was also my first approach to build a JavaScript parser in JavaScript meaning using hybrid approach. Some parser is generated by some parser generator and then I tried to work around some problem by manually writing some code. But it turns out with JavaScript, you end up writing so much code that it doesn’t really make sense to keep the generator proper from the parser generator anymore. And that’s how I’d done it with Esprima. So in some cases, if you describe a grammar for say, a while statement, the grammar for the while statement looks pretty clear. And if you go through the code, sometimes the code is messier. But at some point, you realize that just by looking at how while statement parsing is implemented, it’s not much more difficult. And therefore, you just dump the grammar and there’s no need to feed the parser generator to generate the parser anymore because if you code it carefully, the parser code is obvious, is understandable enough. I don’t know if I make sense there. MERRICK:   Yeah, it makes sense. DAVID:  Basically, what you’re saying is that because the grammar is a pretty clear specification, it’s not that hard to just follow it by hand and just write the code yourself. ARIYA:   Right. DAVID:   And another thing is people often find that when they write it themselves, they can start to hand optimize it. And parsers tend to be one of those things where it doesn’t change very often because you don’t change the syntax. So, once you kind of get it working, then you can go and start optimizing it to try to get better performance out of it and not have to worry about your optimizations needing to be undone in the future because you’re not going to be changing the syntax anyway. So, when you have production quality JavaScript parsers, for example, there were some, I believe historically, in the production JavaScript engines that use parser generators originally. But then, as performance became more and more competitive in JavaScript engines, they pretty much all ended up handwriting their parsers so that they could hand optimize them. ARIYA:  An example would be JavaScriptCore, which is the JavaScript engine in WebKit. It was originally derived from KGS, from KGE. And KGS grammar is described for Bison, I believe. And then it is generated to C and then we have to bind it to C++. So, there’s a whole bunch of layers there and you still need to implement some native C++ code to work around some parsing problems. Semicolon for example; automatic semicolon insertion that requires work around almost everywhere. It just doesn’t make sense to keep the old generated code. CHUCK:  So, that kind of leads into the next question and that is the specification changes for JavaScript. Some of these is just, you’re essentially adding new tokens or adding new functions to your list of things that JavaScript can do. And in some cases, like the semicolon injection stuff, it kind of touches everything. So, what are the tradeoffs? I mean, if you have a parser that you’ve been able to just generate and then they introduce some kind of breaking change, are you back to square one or square zero? DAVID: **Well, the bottom line, like the number one ground rule is you’re not allowed to make breaking changes except in extremely corner cases where you think no code will actually be affected by it. This is one of the really tricky things about migrating a syntax or evolving a syntax is it’s really easy to accidentally add something that breaks the rest of the language. It could either break backwards compatibility or it could introduce an ambiguity where there might be two possible ways to understand some piece of syntax. And that’s one of the worst kinds of bugs when you’re designing a syntax because now, it’s totally unclear what the behavior of the program should be because the syntax could actually mean two totally separate things. So, one of the nice things about parser generators is even if you don’t use them in production, if you run a grammar through a parser generator and it succeeds at providing you with a parser on the other end, unless it’s some strange esoteric parser generator that can handle ambiguous grammars which those are rare. Most parser generators will fail if there’s an ambiguity. That’s actually a way to tell, “Okay, we definitely have an unambiguous grammar.” There’s no possible string of characters that could be interpreted in two contradictory ways. But it’s really surprisingly tricky. And ASI is one of the worse ones. Automatic semi colon insertion is one of the worse ones. Actually, it’s probably a good moment for a public service announcement. When you design your next programming language, please don’t use Automatic Semicolon Insertion. It’s just more trouble than it’s worth. [Laughter]**JAMISON:  Don’t worry, I’m not. DAVID: **It’s going to lead to a really difficult language to parse. But on top of that, it’s going to lead to endless Twitter wars between people arguing about whether they should use it or not. So, just don’t bother. [Crosstalk]TIM:  I think Go-lang has that but you’re not supposed to put them in your actual written code. It just adds them for you when it compiles. There’s something weird about it with Go-lang. DAVID:  Go? TIM:   Yeah. Don’t they? DAVID:  Go actually has similar problems. They chose different policies to JavaScript. I could be wrong about this. I’m not an expert at Go. But from what I understand, they actually have some similar kind of hazards to JavaScript where -- like in JavaScript, if you see a return and then an expression but you put a new line in between them, famous hazard. Where you think that’s a return with an argument but it actually gets a semicolon inserted. I think in Go -- people will have to correct me after the show if I got this wrong. But if you do like a function call where the function returns another function and you want to call that again. So like F(), if you put that second pair of paren’s in the next line, I think it actually inserts a semicolon between them. So, that’s different from JavaScript which doesn’t insert a semicolon. But that sort of a similar hazard. So, in my opinion, semicolon insertion ends up being kind of too hazardous. I think you can introduce syntaxes where they’re not based on semicolons but there are these kinds of ambiguities. It’s not exactly an ambiguity but it’s just a human confusion, an easy confusion for humans. But doing it through automatic insertion of semicolons just seems like it’s always more trouble than it’s worth. [Crosstalk]**ARIYA: **I can't always figure out the automatic semicolon insertion rules. So, I always joke around that it’s called semi-automatic semicolon insertion. [Laughter]**DAVID:  It’s a semi-automatic weapon. CHUCK: **That’s right. You can pull the trigger as many times as you want and it will keep shooting until you’re out of bullets. [Laughter]**ARIYA:  What do you think it’s doing? Talking about syntax change again. For example, that syntax change that Dave mentioned that sometimes you don’t need to tweak the syntax very much or you just pass it to the parser generator. But there’s also syntax change that will cause major disasters. For example, one of the nice things about C is that you cannot have nested functions. There cannot be functions inside of functions. So, if the next person with C language says, “Let’s change syntax so you can have nested functions.” That will be just a nightmare. TIM:  What are some of the problems that it’s going to cause for C syntactically? ARIYA:  Well, you need to understand and rewrite all the parsers again because those do not understand that yes, you can have a function inside a function. DAVID:  Oh! So, it’s just going to be a lot of work to update all those processes? The worst case scenario is when you introduce something where it actually causes an ambiguity. And that’s the one where you might not even notice. And I've seen it happen where we tried to add syntaxes to the ECMAScript standard and it took a while for somebody to even notice, “Oops! That introduces an ambiguity.” JAMISON:  Oh, there’s not a way that you can automatically detect that? DAVID:  That is one of those fun computer science questions that the general question of ‘give me any grammar whatsoever and tell me if it’s ambiguous’ is actually mathematically impossible to build an algorithm that can give you that answer in finite time. JAMISON:  Does it give it a halting problem or something? DAVID:  Yeah, through a series of really complicated proof. But essentially it ends up being like -- I think it went through this classic problem called post correspondence problem or something. I’m forgetting it now but PCP was the shorthand for it. I think that was the way it was proved and then that can be reduced to the halting problem. Anyway, the long story short is, you can’t write a general purpose algorithm for any grammar but for certain subclasses of grammars, you can. So, that’s why if you have a parser generator that works for a certain subclass of grammars and there’s all these different families that Tim eluded to earlier like there’s LL grammars, LR grammars, LALR grammars, and all these weird technical names for different sub-families of grammars. Each one of those, you actually can produce an algorithm that can prove whether a grammar is ambiguous or not. So typically, the real parser generators can actually do that for you because they only operate on these subsets. JAMISON:  So, how much of language design is limited by people trying, by the language designers trying to make their lives easier? Because I look at it totally from the outside, like I just wanted this feature and I have no idea what it takes or how much work it will be, it’ll just be really cool. Does that happen a lot? Do you just reject things because they’ll be really hard to implement even if they’d be a cool or useful feature? TIM:  I’ve done that a few times with my new language I’m designing. I have this constraint that it must be an LALR, or whatever the default in Jison is, grammar. So that I can always use the generator to like check my grammar to make sure that it’s not ambiguous. There have been many features that I wanted to add and the grammar is like, “No, don’t do that.”  So I like, move around the grammar definition and I cannot find a way to make it unambiguous. And so, I just don’t have a lot of feature in the language. One that I’ve been having trouble with lately is argument destructuring which is a really cool feature. But it’s kind of tricky to do. DAVID:  Yeah. We’ve run into some of the things with the Rust Programming Language, where again, we were trying to keep it so that the syntax doesn’t get too complicated so that it’s easy to build tools for Rust. You have to remember when you’re designing a language that you serve many, many different customers. And certainly, your self as the implementer is one customer. But simplicity of implementation also affects tool writers, and those tool writers may not be your self. And the easier you make things for tool writers, the easier you ultimately make things for developers because developers are going to use these tools too. So, even though there might be something that kind of has an immediate benefit for a developer because as a programming feature, it’s nice, if it hurts the ability of tool writers to write tools, that could ultimately indirectly hurt developers, or in the longer run, hurt developers too. So, there’s a lot of constraints you have to keep in mind at the same time. On the other hand, I tend to still be willing to violate even my own ground rules that I set for my self. Like, okay, we want to make it easier for tool writers but, oh man! This syntax is just too nice. We’ll make an exception here and we’ll make it a little harder for the tool writers. They’ll still cope. Tool writers are smart people. They’ll figure it out. JAMISON: **I guess it’s like an opportunity cost, right? Like if the syntax is nice enough, then more people will use the language, then you’ll have more tool writers and things like that. [Crosstalk]**DAVID:  But you’re also predicting what people are going to do in ten years, like when do we have a language that actually takes the world by storm? And a million people will write tools for it. That’s so far into the future. It’s actually hard to predict what it’s going to look like. CHUCK:  I want to go back to something that we’ve kind of talked about but didn’t really, we didn’t talk about as far as what it actually means. So, we talked about kind of what a grammar is. But how is a grammar actually defined? DAVID:  I’m glad you asked this because this is something that I think is really a great general purpose tool for people to learn. And I think the analogy I want to make is for Regular Expressions. So, Regular Expressions are just this weird little language that people learn. And I mean, the first time you look at a Regular Expression you think, “Okay. Computer Science people have really lost their minds.” This is the weirdest, most arcane little thing I’ve ever seen. People talk about it as write once, read never. And it is a really weird little language but it’s so powerful. And it’s so concise that people use it constantly to the point where they just write little grunt command lines and do little one off Regex because they’re just so powerful. And so, it’s actually worth it to put in that effort to learn Regular Expressions because it’s so powerful. But it turns out the Regular Expressions are actually nowhere near as powerful as grammars are. Grammars are sort of the next step up in power for being able to describe a class of syntaxes. Regex and grammars are really doing the same basic thing. They’re saying, “Here’s what a particular syntax should look like.” Then you run a recognizer to actually parse it. The thing is that the class of syntaxes that a grammar can recognize is far, far more expressive than the class of things that a Regular Expression can. So, it’s worth it to learn the notation on grammars because it’s just so much more expressive. And then once you kind of learn that, you can use that for all sorts of things. For example, if you want to create a new configuration file format for your application, or you want to describe some network protocol where you’re sending things encoded in strings over the Internet. There’s all sorts of things where people need to describe a syntax format and Regex is not powerful enough to do the job. The other cool thing about grammars is that they’re still less powerful than just writing a regular program. And so, they’re sort of this nice sweet spot in between. So, the notation is known as BNF. It stands for Backus-Naur Form, named after two famous computer scientists. And so, taking the time to learn BNF I think is a really valuable skill for programmers to learn. JOE: **And how does one learn these [inaudible] secrets of which you speak?DAVID: [Chuckles] That’s a good question. I think a lot of BNF is covered in, I think probably most compiler textbooks will cover BNF. You can probably get a lot of material just from Googling it because it’s such a common tool in Computer Science that there’s probably a ton of literature just online. The other thing is like to understand BNF, BNF itself is almost like a tiny little programming language. For the most part, it’s a really simple little programming language. You basically define a bunch of functions and they can call each other. That’s really all there is to it. The one hard thing that’s conceptually tricky that you need to get your head around before you get to that point is to be able to think recursively. And that’s a skill a lot of people don’t have and aren’t born with. It certainly took me a whole bunch of training before I felt comfortable with recursion. And BNF grammars are defined all using recursion. So for that, I’ll just throw out one textbook that I particularly love. It’s not in JavaScript so you’ll have to be willing to play with an esoteric parenthesis heavy language, Racket, which is kind of a Lisp-like language, or Scheme-like language. But the book is called ‘How to Design Programs’. And their website is HTDP.org. The whole text of the textbook is freely available online. That’s a great introduction to thinking recursively. And I think it’s a great precursor to being able to understand BNF. [Crosstalk]**DAVID: **Oh, was it? I guess I’m too obsessed with it. [Crosstalk]**DAVID:  So, it could be overkill. Like if you feel that you already understand recursive functions, you actually shouldn’t have too much of a hard time being able to write BNF and I would just Google it. ARIYA:  Classic example to write BNF is for math expressions. So, it has to understand different operators and their precedents. And you can try to parse simple mathematical expressions. That’s how it goes. CHUCK:  My understanding is that you define expressions and then you tell it how to recognize the expressions based on the tokens that make it up in a grammar. ARIYA:  That’s pretty much it. CHUCK:  Then yeah, basically what it does is it says, “Okay, I recognize this expression.” And so then, it takes the arguments to that expression and it breaks those expressions up into its tokens. And that’s how you get the Abstract Syntax Tree of your program. DAVID:  Right. And that’s also where the recursion shows up because to define what an expression looks like, well, an expression could be an addition expression which has a plus in it but it also has two other expressions. So, the definition of an expression includes parsing another expression or two other expressions. CHUCK:  Right. And the expression can be a single token like a number or it can be another expression which is ten times ten. DAVID:  Right, exactly. TIM:   That part’s easy. The hard part is sticking to the rules of whatever subset you’re using. Some of them don’t allow left recursion and some of them, if you don’t have an operator precedence for infix operators, then it blows up. And so, intuitively, a BNF is very easy to read, but writing one is a little harder. DAVID:  That’s true because the general rules of BNF actually don’t prevent you form creating ambiguous grammars. And so, there’s all these subsets of the possible grammars that prevent you from ever saying something, from ever writing unambiguous grammar. But understanding the rules of how to stay within that subset is much trickier. That definitely takes some practice and it takes playing with the tools. This is actually something I mentioned before. The research was kind of seen as done by the ‘80’s or ‘90’s. One of the things, I think, they didn’t work on back then was really making the tools easier to use. So, when it blew up, it would give you these just incomprehensible error messages. And the state of the art in that hasn’t gotten a whole lot better. And that’s something I would love to see people do more work on is, okay, we understand how parsers work but can we actually have parser generators that help you debug the problem when you made a mistake in your grammar? TIM: **The error messages you get in Jison, the one that CoffeeScript in my Language use, they’re useful. But if you don’t understand the table that is built in internally, they’re like utterly useless. It’s like in State 47. I have this shift reduce conflict. I’m like, “Okay. And that means what?” [Laughter]**ARIYA: **That was my experience, as well, as I switched from parser generator to [inaudible] parser for Esprima. I found the debugging is much simpler because you can just insert the debugger somewhere and then watch call stack and finally understand what’s wrong there. As opposed to some cryptic message and there’s no call stack at all.**TIM:  So, tell me about Recursive Descent Parsers. I have not yet figured these out. My plan is to write one after my grammar is stable. How do they work? ARIYA:  I think Dave, do you want to explain that? DAVID:  Sure, I could give it a try. Actually, I think Recursive Descent Parsers, even though the name sounds scary, are sort of like, to my mind, they’re the first way you would think to implement a parser. And they sort of follow the structure of the grammar most naturally. So, if your grammar says that an expression can be either a number or an expression followed by the plus sign, followed by an expression, then you write your Recursive Descent Parser that just says, “Okay, let’s look at the token. If it’s a number, we’re done, return a literal node. Otherwise, just recursively call the expression parser, then look for the plus token, then recursively call the expression parser. Now, we’ve got three nodes, put them together.” And you’re done. So, it’s just sort of like a recursive function that does what the grammar says to do. ARIYA:  My favorite example is while statement JavaScript because while statement is pretty simple. You have the test and the body. And it constructs itself is you expect a keyword, while, and then it is followed by an open bracket and then expression, close bracket and then the statement. So in your parser, you simply write expect some keyword, expect this bracket, and then parse the expression. There’s where it goes recursively to the other function. And then, after that, you close it and then go recursive again parsing statement. That’s all you get for while statements. DAVID: **So, Recursive Descent, kind of by itself, is, I think, the cleanest and most natural way to write a parser following the grammar that specifies what it’s supposed to look like. The couple places where it can get tricky are, first of all, when you deal with infix operators. Dealing with the precedence of those operators is tricky. So you have to know when to keep looking for more arguments and when to return. So, that part’s a little bit tricky but you can kind of follow your nose and figure it out. It’s actually a fun problem to work on. But it’s a solved problem but it’s one of those that’s fun to like, “I’m not going to look up the answer. I’m going to see if I can write this program myself.” [Crosstalk]**DAVID: **Yeah. He has a paper on that which is basically just showing, “Here’s one way that you can write that algorithm that was published in the ‘70’s. Here’s one way you could write that,” like if I was writing a parser in JavaScript that was doing precedence parsing. I mean, that’s one way to do it. I actually found that code kind of hard to follow. So, precedence parsing can be one challenge, the other challenge is if your language doesn’t support recursion version very well, then you can blow the stack if you have too deep of a nested program that you're parsing. So then, actually taking your recursive program and turning it into one that’s iterative that only uses loops and builds and stacks, is kind of a royal pain. For a simple Recursive Descent Parser where you don’t really worry about it, where you're like, “If the input program gets too big, we’ll just blow up.” Then, you don’t have to worry about it. But if you start doing things like -- I think we ran into this with [inaudible] where we load it in JavaScript which doesn’t have very deep stacks and [inaudible] these days is parsing million plus line programs. And so, there were times when it just fell over because of the size of the input.**CHUCK: **Nice. [Chuckles]**TIM: **I guess I just never thought a program would be that big. [Laughter]**DAVID:  Well, when you think of a human writing it, you don’t think of it being that big. But when you think of it as, I got to parse, any program that comes in, it could even be one that was generated by somebody else’s code, suddenly you could imagine it getting a whole lot bigger. ARIYA: **You could also definitely use the hybrid approach, in Esprima, for example. All function blocks, I use recursive descent. But as soon as I hit an expression like binary and unary, then the switch is a simple stack base parser, shift and un-shift, that kind of thing. [Crosstalk]**ARIYA: **That’s the pattern that every JavaScript engine uses, I believe. Even SpiderMonkey switched to that [inaudible] just like in V8 and JavaScriptCore.**DAVID:  And that makes sense because you don’t imagine that you’re going to have so many deeply nested functions. I mean, let’s say your stack blows out at 3000, do you imagine that you’re going to have 3000 deep nested functions? But you can imagine a really, really big little operator expression, particularly if it was one that was generated by some other compiler. Those could actually get deep quicker. JAMISON:  I want to talk about this from the perspective of the humble JavaScript developer, someone who isn’t a language implementer because we have lots of experience. I mean, Tim and Ariya and David are all experienced with language internals, I guess. But what does this stuff mean for someone who’s just writing applications? Is this useful just in a general, ‘the more you know, the better off you’ll be’ kind of way or are there some practical benefits that you can get from knowing more about this stuff? ARIYA:  **As Dave alluded earlier, I always see that the benefit is more on the tooling side. Yes, syntax tree is fun but the fun part is what you do with that. So, the classic example is you can build [inaudible] analysis like Jasmine/Jasmine. Or you can build [inaudible] one known example in this case. So, it’s really on what kind of set of tools that people can build as soon as we get parsers and grammars and syntax tree.**DAVID:  But I think another place that’s good to think about using grammars and parsers for is that not every syntax is actually a full fledged programming language. The example I mentioned before is a config file format. And maybe you’ll just reuse an existing config format like JSON or JSON+ comments which we all wish JSON had. But little things like that crop up in programs a lot especially if you’re doing anything with external storage. You’re saving some files on the file system or network protocols where you’re communicating between different things. And again, you can always reuse existing formats. And then, if you do it in JSON, you can just reuse JSON.parse. But it does come up where we need to write this and maybe JSON just isn’t the right syntax for the people who are writing this file format. It needs to be handwritten and it needs to be in a format that we find easier to read and write. Or maybe JSON is too verbose. And there’s like some, more concise syntax we can come up with. So, I think these things do come up and it’s a useful skill to have. And the other thing is like -- I don’t know if people have heard that famous Jamie Zawinski quote:  some people look at a problem and think, “I know, I’ll use a Regex.” And now, they have two problems. That quote, the inspiration for that quote is really that when a Regex is the only tool you have, you start to think that it will solve everything. And there are some things that Regex just isn’t powerful enough to solve. And the next step up for Regex is grammars. So, it’s good to know what the next step of power is up from Regex so you can recognize better, “In this instance, a Regex is not the tool for the job. It’s not the one that’s going to solve my problem.” JAMISON: **That’s your answer to all those questions about parsing HTML with Regexes? [Laughter]**DAVID:   Yeah. JAMISON:  Not a regular language. MERRICK: **The other thing as a humble JavaScript developer, I don’t come from a Computer Science background. So, these kinds of things enable like the templating languages like Handlebars, where you can even add kind of your own functions into these DSL’s that we use in the web every day, like Angular. You can hook in and make your own directives, but these always hook in at the compiler and they end up being a lot more expressive. So, as a humble JavaScript developer, you’re using these kinds of things all the time and understanding them better means you’re going to be able to use them better. [Crosstalk]**DAVID:  Sorry. I’ll just give this example really quick. Another example is jQuery where you basically have something a lot like CSS embedded in a string. And there are definitely arguments for and against doing that kind of thing. Inventing your own DSL’s and embedding them in strings. But I think jQuery obviously has gotten a huge amount of mileage out of that. And they never could have done that if they didn’t have somebody who could write a parser for that embedded language. MERRICK:  And you’re referencing, for the humble JavaScript guy, you’re referencing what’s passed to the jQuery object, like what’s the HTML string or a CSS selector, right? DAVID: **Right, exactly. So, you say $ of the string.foo .bar, or .foo #bar, whatever. That’s the [inaudible] language that has a syntax and somebody had to write the parser to parse that string to be able to make jQuery work. And if you want to write the next jQuery, well, you’re going to have to write a parser.**JOE:  Yeah. I’ve had a couple situations in my career come up, we implemented, even did some tech searching. We used Lucene to do that. And that has all this stuff around whatever you’re indexing, the text you’re indexing, you can get better information out of it like a custom Lexer and parser using the tools that it provides and some other tools that are available. So, that really made a big difference to understand those concepts. And to be honest, at the time, I didn’t understand it very well. I got some introduction to them. And also, you can solve some business problems by writing a DSL that isn’t necessarily for programmers. I’ve done that before as well. TIM:  One of the things I learned from the Lua community is they have a library called LPeg which is basically, I guess, it’s a parser generator or maybe it’s an interpreter. But you basically just give it grammars and then feed it strings. And it’s like a more powerful Regular Expression engine. But they’re not regular, they’re full grammars. And you get the tree as your match. I always wish I had something like that in JavaScript readily available where I could just write a small grammar and match against it the same way that I can just write a Regular Expression to match against it. DAVID:  You should totally write that. TIM: **I’ll do a kick starter for it. How’s that? [Laughter]**CHUCK:  Are there any other aspects of this that we want to cover before we get to the picks? DAVID: **I guess one thing I also wanted to point out is that an ASC is kind of like this key to a massively bigger kingdom of things you can do in your program. It’s like if I hand you the source code to some JavaScript, well, you’ve got eval so you can pass it to eval. But other than that, Regex, there’s only so much you can do with Regex. As soon as you can actually turn that into an AST, there’s all sorts of interesting things you can do with that data structure. But on the other hand, it’s good to have a healthy respect for just how complicated that space is. You could write an interpreter for JavaScript with that AST, you could write a static analysis engine for JavaScript with that AST, you could write a compiler to compile that JavaScript to something else. Every one of those topics is like this big huge topic that’s really complicated and incredibly fun and worth learning about. But also, I’ve seen some people who kind of get their hands on in AST and they write their first static analysis and it’s like totally broken. [Chuckles] So, it’s good to recognize that these are complicated areas and fun topics to learn about but also very big topics.**MERRICK:  They're very powerful topics like, for example, one thing I had to do is write something that lets you basically pre-compile Handlebars templates but also be able to dynamically pull in partials, et cetera, from other files using AMD. The only way I was able to do is get access to the Handlebars’ Abstract Syntax Tree and figure out what were partials, come up with a path, and pull them in. So, for someone who doesn’t, I’m not a programming language expert by any means, I was able to take advantage of this Abstract Syntax Tree and add a whole another level of power to our applications. ARIYA: **That’s what [inaudible] the augmenting the existing syntax, all the features.**CHUCK:  Nice. Alright. Well, I’m going to wrap this up and get us into the picks. Merrick, do you want to start us off with the picks? MERRICK:  I’m just going to pick these guys that we were on the episode with because this is like one of those episodes that’s incredibly humbling. And being on the show with these guys is pretty awesome. So, that’s all I got. CHUCK:  Awesome. Joe, what are your picks? JOE: **So a little bit ago, Origin had a sale and had Mass Effect 3 on for $10. I hadn’t bought it and played it yet with [inaudible]. I’m going to pick Mass Effect 3. I’ve been playing it recently and I really enjoy it. Also, there’s a Coursera course that just barely launched this week. So by the time this gets published, it will be like two weeks ago. I think you’d still be able to sign up for it. It’s called ‘A Beginner’s Guide to Irrational Behavior’ and it’s done by Dan Ariely. I think that’s how he pronounces his last name. He’s the guy who wrote Predictably Irrational which is a very popular book about human behavior and how irrational we are although we’re predictable of it. It’s a whole Coursera course about it. I don’t know if it’s four or six weeks, all about human behavior, they call it economic behavior or something. I can’t remember what the scientific term for it is. Anyway, it’s really, really interesting stuff. So, I’m going to pick those two.**CHUCK:  Alright. Tim, what are your picks? TIM:  I don’t know. I just think programming languages are fun. I don’t really have any picks in particular other than if you ever think programming gets boring, go write a language. It won’t be boring anymore. And you’ll learn things about the language that you didn’t ever realize, things you just took for granted. So, I guess that’s my pick, go write a language to learn one. CHUCK: **Awesome. I think I’m going to go write Ruby Script and Coffee and then, I can get the same dumb questions that I get about JavaScript. [Laughter]**CHUCK:  So JavaScript, is that Java? No. CoffeeScript, is that coffee? No. Anyway, who hasn’t given the picks yet? I guess I’ll go and then we’ll have our guests go. Jamison, did you give us picks? JAMISON:  I did not but you can forget me. CHUCK:  We’ll let Jamison go then. JAMISON:  I’ve got three. My first one is a dumb blog called Thumbs and Ammo. It’s just a little Tumblr picture blog. And it’s pictures of action movie stars where they take their guns and they Photoshop thumbs in place. So, it looks like they’re giving people thumbs up. It’s pretty awesome. It made me giggle. My other one is totally a guilty pleasure. I make fun of Dubstep a lot. But then sometimes, I fall into it anyway. So, it’s this Dubstep album called ISM by Savant, I think, is the artist. It’s kind of like chiptuneish-dubstep, but it’s totally like awful music. It’s still good to listen to though, I enjoyed it. And then my last one, I haven’t been here in a while and then I came back again and it was so good. I went to Vimcasts.org again and I forgot how amazing his accent is. So, if you use Vim and you want to learn more about it and you also want to just be entertained by listening to someone speak English in a great accent, go to  HYPERLINK "https://www.Vimcasts.org" Vimcasts.org. Those are my picks. CHUCK:  Nice. Alright. So, I’m going to pick a couple of things. One of them is a little bit self-serving and that is, well, it’s self-serving in the sense that it’s something that I’m doing. But it’s something that obviously I’m giving back to the community. We are starting the iPhreaks Show next week. And that will be on iOS programming. So, if you have an interest in that, then go to  HYPERLINK "https://www.iPhreaksShow.com" iPhreaksShow.com. The site should be up by the time this gets released. It’s not there now. But anyway, go check it out. We’ve got some awesome panelists and it’s going to be a lot of fun. My other pick is Mozy. And Mozy is actually a company that I used to work for, it’s here in Utah. But it’s an online back up company. The thing that I really like about it is that I know the algorithms that they use to store and everything because I did tech support for them and I’m really comfortable with the fact that my data isn’t going to vanish from their servers. But it’s also not here as a local backup. So, I can use Time Machine on my Mac and get local backups but, heaven forbid, if somebody broke into my house and stole my machine and the hard drive with it or the house burned down or something, I wouldn’t be totally out of luck. All the stuff that I’ve worked hard on is still out there and I can just restore it to a new machine. I typically think that it’s a good idea to have a local backup and a remote backup and then you’re just covered for all the ‘just in cases’. David, what are your picks? DAVID:  Alright. My first pick is the micro USB cable that I bought from Walgreens because I lost my last cable. The only one they had was bright pink. I needed one so I just bought it. And I had no idea what a benefit it would be to have a bright pink micro USB cable because wherever I go, I never forget it. It’s always the most noticeable thing in the room. So, I never walked out without it. ARIYA: **Nobody’s going to steal it as well. [Laughter]CHUCK:  I don’t know. I have some kids that would like it. DAVID: [Laughs] My other pick is something that I’m super excited about. We’ve been working on this in my group at Mozilla Research along with the Mozilla JavaScript engine team. And we just put out our first release of the subset of JavaScript called asmjs or asm.js. And what ASM is, is basically we’ve just said, “Look, here’s a subset of JavaScript. It’s totally just JavaScript. It works in every JavaScript engine but if you can generate code in that subset, we can optimize it like nobody’s business.” And so, what we did was we worked with the game company Epic who make the Unreal Game Engine and we helped them port their Unreal Engine 3 to the web. And we’re talking about over a million lines of C++ code being ported to JavaScript through the M Scripting compiler and generating asm.js on the backend. And showed that in the latest nightly builds of Firefox, it gets performance that feels exactly like native performance. There’s a little YouTube video of it online where you can see some of the flythroughs. We haven’t actually released any of those demos. I think, at least, one of them is going to be released publicly where people can play with it.**JAMISON: **I was going to ask, when can I play Unreal tournament in my browser? [Laughter]**JAMISON:   I saw the DVC videos but… DAVID:  Yeah. I think they’re just going to be demos. I don’t think there’s actually going to be an Unreal full game that’s released but I’m not sure about that. Anyway, it’s a technology demonstration. We’re just trying to show that the web is actually ready with no plug-ins to run games the same way native can. And with asm.js, we’re closing the gap to native performance to the point where, our benchmarks are showing we’re within 50% of native speed at this point. So, I’m very excited about what we’re showing the web can do. CHUCK:  Interesting. Ariya, what are your picks? ARIYA: **My first is related to that Coursera course about human behavior. It’s a book called ‘Beyond Office Politics: The Hidden Story of Power, Affiliation & Achievement in the Workplace’ from Linda Sommer. So, this tells you why a person behaves certain ways and how you can work together with that person once you know their motivation and objectives. Really fantastic book. Typically, for engineers to understand the social interaction between people. The second one is a library that I’m having fun with, it’s a library and a tool. It’s called Istanbul from Yahoo. It’s a JavaScript code coverage and it tells you exactly what part of the code that you didn’t test. That includes all the branches and segments and function and so on. Famous quote from [inaudible] from Yahoo, “If you think [inaudible] hurts your feelings, wait until you use Istanbul.” [Laughter]**CHUCK:  Alright, sounds great. We’ll go ahead and wrap up the show. Thanks for coming guys. It’s been a real interesting talk. JAMISON:  Thank you so much. DAVID:   My pleasure. ARIYA:  Thank you. CHUCK: ** We’ll be on next week. Well, you guys will be on next week. I’m going to be gone next week but we’ll see you all in a week or two.

Sign up for the Newsletter

Join our newsletter and get updates in your inbox. We won’t spam you and we respect your privacy.