RubyConf 2015

Lo que las expresiones regulares nos pueden enseñar sobre el diseño de lenguajes DSL

Betsy Haibel  · 
DSL

Transcripción

Extracto de la transcripción automática del vídeo realizada por YouTube.

- My name is Betsey Haibel and this afternoon we're going to be speaking about regexes and specifically their DSL design and what we can learn from it when we're deciding on their DSLs. So just to keep everyone on the same page, we're going to start with a quick introduction for regular expressions to anyone who's not familiar with them, or for anyone in the audience who could use a refresher since they haven't worked with them in a while.

Here's the simplest regex I can think of. It searches a given text for the letters d, o, and g in that order and with no characters between them. So it'll match any of these strings here. And here's a less trivial example. In this one, we use the period wild card to match any character.

Since this wildcard matches any character, the regular express d. g, which is now on the screen, thank you Google, can match the strings did, d space g, d exclamation point g, or a lot of other things. There are a lot of other little wildcards. They can match more specific things as well.

Word characters, white space. God damn it. Even a thing called a word boundary, which is the first or last character of any given word. Both characters and wild cards can be grouped, if the default groupings aren't powerful enough, and it can specify the number of characters to be matched with other wildcards, like star and plus.

The specifics matter less right now than the mere fact that there are a lot of things you can do. Getting a little more complex, you can use capture groups to single out specific subsets of your match for special treatment, and a back reference to refer to a previously captured captured group.

Also, Peter Piper picked a peck of pickled peppers, later on within a single regular expression. So we've got all these building blocks, and individually they're pretty simple. Come on, good little computer. There we go. And I'm not going to pretend that all regex are simple.

This, for example, is an email validation regex that someone, somewhere, for some reason, recommended that other programmers use in production. The simple elements that make up regexes can be combined in ghastly hieroglyphic-esk ways and often are. So at this point, you may be wondering some things.

Things like whether it is possible to learn about designing DSLs or indeed about designing anything from something that produces screen-fulls of mess and that can't even fully parse an email address in the process, because, of course, that email validation regex I just showed you did not actually work.

And the answer is that regex are old. Like C, like shell scripts, like them, regex are gawky and horrible and everyone has used them for decades, anyway. They are too bloody useful to erase. They are too bloody useful to give up, no matter how much we try to replace them with tools that are not only aesthetically prettier.

Anything that bloody useful has to teach us design lessons, whether its surface seems polished or not. The biggest goal of software design, over and above how elegant things are, is getting the damn thing to work. And regex, bless them, do that, if nothing else.

So that, as we will see later in this talk, is because they get to cheat, but we can still learn from the ways they cheat. So how old are regex anyway? They were first defined as a mathematical concept back in 1958. They were an outgrowth of set theory used for describing grammar of regular languages.

A decade later, they were implemented as a simple, independent programming language. Note that this first implementation treated them as a programming language in their own right. A few years after that, they began to see wider use when they were embedded into a concrete tool, the Unix utility grep.

They then became embedded in more and more powerful tools such as sed and awk, and were embedded into the programming language Perl in 1987 as a first class language concept. In other words, regular expressions got a lot more powerful and useful and therefore a lot more used.

When they became a domain-specific language for string processing, embedded within a more general purpose language. In the 28 years since Perl came on the scene, regex implementation has been baked into countless other programming languages. We're at the point where they're considered a language feature rather than a language in their own right.

Most programmers had forgotten that or never knew. And when I find regex historically like that, contrasting their early days as a programming language in their own right, with their modern days as an embedded DSL, it naturally leads to question. What are DSLs anyway? Are they appreciatively different from programming languages? While I don't necessarily think that the c2 wiki is an authoritative source, it's someplace where a lot of smart people have had a number of informed opinions.

A number of informed opinions over the years. And they define DSLs, and this consensus that is reached through a sheer stunning amount of debate, as programming languages, as programming languages designed specifically to express solutions to problems in a specific domain.

There are a lot of spirited discussions about the merits of this pattern, because two programmers and three opinions and c2 wiki, but it's universally agreed by all of these programmers with all of these opinions that both their potential beauty and the potential horror of DSLs stems from their place as languages in their own right, because languages are difficult to design.

They also do some talking about whether regex are actually a DSL, fascinatingly enough. A lot of people don't think they're complex enough to count as a language. To each their own, but I'm the kind of person who will die on the hill that CSS and SQL are also programming languages.

And regex have far more complex control structures, even if these control structures are not actually powerful enough to avoid this kind of email validation regex, and to let you express those ideas in a more concise fashion. But that cautionary tale aside, which is absolutely what we think of when we think of regex in fear, in the wild, most production regex are a lot closer to this basic example, and while d.

[ ... ]

Nota: se han omitido las otras 2.959 palabras de la transcripción completa para cumplir con las normas de «uso razonable» de YouTube.