SMASHINGMAGAZINE.COM
Regexes Got Good: The History And Future Of Regular Expressions In JavaScript
Modern JavaScript regular expressions have come a long way compared to what you might be familiar with. Regexes can be an amazing tool for searching and replacing text, but they have a longstanding reputation (perhaps outdated, as Ill show) for being difficult to write and understand.This is especially true in JavaScript-land, where regexes languished for many years, comparatively underpowered compared to their more modern counterparts in PCRE, Perl, .NET, Java, Ruby, C++, and Python. Those days are over.In this article, Ill recount the history of improvements to JavaScript regexes (spoiler: ES2018 and ES2024 changed the game), show examples of modern regex features in action, introduce you to a lightweight JavaScript library that makes JavaScript stand alongside or surpass other modern regex flavors, and end with a preview of active proposals that will continue to improve regexes in future versions of JavaScript (with some of them already working in your browser today).The History of Regular Expressions in JavaScriptECMAScript 3, standardized in 1999, introduced Perl-inspired regular expressions to the JavaScript language. Although it got enough things right to make regexes pretty useful (and mostly compatible with other Perl-inspired flavors), there were some big omissions, even then. And while JavaScript waited 10 years for its next standardized version with ES5, other programming languages and regex implementations added useful new features that made their regexes more powerful and readable.But that was then.Did you know that nearly every new version of JavaScript has made at least minor improvements to regular expressions?Lets take a look at them.Dont worry if its hard to understand what some of the following features mean well look more closely at several of the key features afterward.ES5 (2009) fixed unintuitive behavior by creating a new object every time regex literals are evaluated and allowed regex literals to use unescaped forward slashes within character classes (/[/]/).ES6/ES2015 added two new regex flags: y (sticky), which made it easier to use regexes in parsers, and u (unicode), which added several significant Unicode-related improvements along with strict errors. It also added the RegExp.prototype.flags getter, support for subclassing RegExp, and the ability to copy a regex while changing its flags.ES2018 was the edition that finally made JavaScript regexes pretty good. It added the s (dotAll) flag, lookbehind, named capture, and Unicode properties (via \p{...} and \P{...}, which require ES6s flag u). All of these are extremely useful features, as well see.ES2020 added the string method matchAll, which well also see more of shortly.ES2022 added flag d (hasIndices), which provides start and end indices for matched substrings.And finally, ES2024 added flag v (unicodeSets) as an upgrade to ES6s flag u. The v flag adds a set of multicharacter properties of strings to \p{...}, multicharacter elements within character classes via \p{...} and \q{...}, nested character classes, set subtraction [A--B] and intersection [A&&B], and different escaping rules within character classes. It also fixed case-insensitive matching for Unicode properties within negated sets [^...].As for whether you can safely use these features in your code today, the answer is yes! The latest of these features, flag v, is supported in Node.js 20 and 2023-era browsers. The rest are supported in 2021-era browsers or earlier.Each edition from ES2019 to ES2023 also added additional Unicode properties that can be used via \p{...} and \P{...}. And to be a completionist, ES2021 added string method replaceAll although, when given a regex, the only difference from ES3s replace is that it throws if not using flag g.Aside: What Makes a Regex Flavor Good?With all of these changes, how do JavaScript regular expressions now stack up against other flavors? There are multiple ways to think about this, but here are a few key aspects:Performance.This is an important aspect but probably not the main one since mature regex implementations are generally pretty fast. JavaScript is strong on regex performance (at least considering V8s Irregexp engine, used by Node.js, Chromium-based browsers, and even Firefox; and JavaScriptCore, used by Safari), but it uses a backtracking engine that is missing any syntax for backtracking control a major limitation that makes ReDoS vulnerability more common.Support for advanced features that handle common or important use cases.Here, JavaScript stepped up its game with ES2018 and ES2024. JavaScript is now best in class for some features like lookbehind (with its infinite-length support) and Unicode properties (with multicharacter properties of strings, set subtraction and intersection, and script extensions). These features are either not supported or not as robust in many other flavors.Ability to write readable and maintainable patterns.Here, native JavaScript has long been the worst of the major flavors since it lacks the x (extended) flag that allows insignificant whitespace and comments. Additionally, it lacks regex subroutines and subroutine definition groups (from PCRE and Perl), a powerful set of features that enable writing grammatical regexes that build up complex patterns via composition.So, its a bit of a mixed bag.JavaScript regexes have become exceptionally powerful, but theyre still missing key features that could make regexes safer, more readable, and more maintainable (all of which hold some people back from using this power).The good news is that all of these holes can be filled by a JavaScript library, which well see later in this article.Using JavaScripts Modern Regex FeaturesLets look at a few of the more useful modern regex features that you might be less familiar with. You should know in advance that this is a moderately advanced guide. If youre relatively new to regex, here are some excellent tutorials you might want to start with:RegexLearn and RegexOne are interactive tutorials that include practice problems.JavaScript.infos regular expressions chapter is a detailed and JavaScript-specific guide.Demystifying Regular Expressions (video) is an excellent presentation for beginners by Lea Verou at HolyJS 2017.Learn Regular Expressions In 20 Minutes (video) is a live syntax walkthrough in a regex tester.Named CaptureOften, you want to do more than just check whether a regex matches you want to extract substrings from the match and do something with them in your code. Named capturing groups allow you to do this in a way that makes your regexes and code more readable and self-documenting.The following example matches a record with two date fields and captures the values:const record = 'Admitted: 2024-01-01\nReleased: 2024-01-03';const re = /^Admitted: (?<admitted>\d{4}-\d{2}-\d{2})\nReleased: (?<released>\d{4}-\d{2}-\d{2})$/;const match = record.match(re);console.log(match.groups);/* { admitted: '2024-01-01', released: '2024-01-03'} */Dont worry although this regex might be challenging to understand, later, well look at a way to make it much more readable. The key things here are that named capturing groups use the syntax (?<name>...), and their results are stored on the groups object of matches.You can also use named backreferences to rematch whatever a named capturing group matched via \k<name>, and you can use the values within search and replace as follows:// Change 'FirstName LastName' to 'LastName, FirstName'const name = 'Shaquille Oatmeal';name.replace(/(?<first>\w+) (?<last>\w+)/, '$<last>, $<first>');// 'Oatmeal, Shaquille'For advanced regexers who want to use named backreferences within a replacement callback function, the groups object is provided as the last argument. Heres a fancy example:function fahrenheitToCelsius(str) { const re = /(?<degrees>-?\d+(\.\d+)?)F\b/g; return str.replace(re, (...args) => { const groups = args.at(-1); return Math.round((groups.degrees - 32) * 5/9) + 'C'; });}fahrenheitToCelsius('98.6F');// '37C'fahrenheitToCelsius('May 9 high is 40F and low is 21F');// 'May 9 high is 4C and low is -6C'LookbehindLookbehind (introduced in ES2018) is the complement to lookahead, which has always been supported by JavaScript regexes. Lookahead and lookbehind are assertions (similar to ^ for the start of a string or \b for word boundaries) that dont consume any characters as part of the match. Lookbehinds succeed or fail based on whether their subpattern can be found immediately before the current match position.For example, the following regex uses a lookbehind (?<=...) to match the word cat (only the word cat) if its preceded by fat :const re = /(?<=fat )cat/g;'cat, fat cat, brat cat'.replace(re, 'pigeon');// 'cat, fat pigeon, brat cat'You can also use negative lookbehind written as (?<!...) to invert the assertion. That would make the regex match any instance of cat thats not preceded by fat .const re = /(?<!fat )cat/g;'cat, fat cat, brat cat'.replace(re, 'pigeon');// 'pigeon, fat cat, brat pigeon'JavaScripts implementation of lookbehind is one of the very best (matched only by .NET). Whereas other regex flavors have inconsistent and complex rules for when and whether they allow variable-length patterns inside lookbehind, JavaScript allows you to look behind for any subpattern.The matchAll MethodJavaScripts String.prototype.matchAll was added in ES2020 and makes it easier to operate on regex matches in a loop when you need extended match details. Although other solutions were possible before, matchAll is often easier, and it avoids gotchas, such as the need to guard against infinite loops when looping over the results of regexes that might return zero-length matches.Since matchAll returns an iterator (rather than an array), its easy to use it in a for...of loop.const re = /(?<char1>\w)(?<char2>\w)/g;for (const match of str.matchAll(re)) { const {char1, char2} = match.groups; // Print each complete match and matched subpatterns console.log(Matched "${match[0]}" with "${char1}" and "${char2}");}Note: matchAll requires its regexes to use flag g (global). Also, as with other iterators, you can get all of its results as an array using Array.from or array spreading.const matches = [...str.matchAll(/./g)];Unicode PropertiesUnicode properties (added in ES2018) give you powerful control over multilingual text, using the syntax \p{...} and its negated version \P{...}. There are hundreds of different properties you can match, which cover a wide variety of Unicode categories, scripts, script extensions, and binary properties.Note: For more details, check out the documentation on MDN.Unicode properties require using the flag u (unicode) or v (unicodeSets).Flag vFlag v (unicodeSets) was added in ES2024 and is an upgrade to flag u you cant use both at the same time. Its a best practice to always use one of these flags to avoid silently introducing bugs via the default Unicode-unaware mode. The decision on which to use is fairly straightforward. If youre okay with only supporting environments with flag v (Node.js 20 and 2023-era browsers), then use flag v; otherwise, use flag u.Flag v adds support for several new regex features, with the coolest probably being set subtraction and intersection. This allows using A--B (within character classes) to match strings in A but not in B or using A&&B to match strings in both A and B. For example:// Matches all Greek symbols except the letter ''/[\p{Script_Extensions=Greek}--]/v// Matches only Greek letters/[\p{Script_Extensions=Greek}&&\p{Letter}]/vFor more details about flag v, including its other new features, check out this explainer from the Google Chrome team.A Word on Matching EmojiEmoji are , but how emoji get encoded in text is complicated. If youre trying to match them with a regex, its important to be aware that a single emoji can be composed of one or many individual Unicode code points. Many people (and libraries!) who roll their own emoji regexes miss this point (or implement it poorly) and end up with bugs.The following details for the emoji (Woman Teacher: Light Skin Tone) show just how complicated emoji can be:// Code unit length''.length;// 7// Each astral code point (above \uFFFF) is divided into high and low surrogates// Code point length[...''].length;// 4// These four code points are: \u{1F469} \u{1F3FB} \u{200D} \u{1F3EB}// \u{1F469} combined with \u{1F3FB} is ''// \u{200D} is a Zero-Width Joiner// \u{1F3EB} is ''// Grapheme cluster length (user-perceived characters)[...new Intl.Segmenter().segment('')].length;// 1Fortunately, JavaScript added an easy way to match any individual, complete emoji via \p{RGI_Emoji}. Since this is a fancy property of strings that can match more than one code point at a time, it requires ES2024s flag v.If you want to match emojis in environments without v support, check out the excellent libraries emoji-regex and emoji-regex-xs.Making Your Regexes More Readable, Maintainable, and ResilientDespite the improvements to regex features over the years, native JavaScript regexes of sufficient complexity can still be outrageously hard to read and maintain.Regular Expressions are SO EASY!!!! pic.twitter.com/q4GSpbJRbZ Garabato Kid (@garabatokid) July 5, 2019 ES2018s named capture was a great addition that made regexes more self-documenting, and ES6s String.raw tag allows you to avoid escaping all your backslashes when using the RegExp constructor. But for the most part, thats it in terms of readability.However, theres a lightweight and high-performance JavaScript library named regex (by yours truly) that makes regexes dramatically more readable. It does this by adding key missing features from Perl-Compatible Regular Expressions (PCRE) and outputting native JavaScript regexes. You can also use it as a Babel plugin, which means that regex calls are transpiled at build time, so you get a better developer experience without users paying any runtime cost.PCRE is a popular C library used by PHP for its regex support, and its available in countless other programming languages and tools.Lets briefly look at some of the ways the regex library, which provides a template tag named regex, can help you write complex regexes that are actually understandable and maintainable by mortals. Note that all of the new syntax described below works identically in PCRE.Insignificant Whitespace and CommentsBy default, regex allows you to freely add whitespace and line comments (starting with #) to your regexes for readability.import {regex} from 'regex';const date = regex` # Match a date in YYYY-MM-DD format (?<year> \d{4}) - # Year part (?<month> \d{2}) - # Month part (?<day> \d{2}) # Day part`;This is equivalent to using PCREs xx flag.Subroutines and Subroutine Definition GroupsSubroutines are written as \g<name> (where name refers to a named group), and they treat the referenced group as an independent subpattern that they try to match at the current position. This enables subpattern composition and reuse, which improves readability and maintainability.For example, the following regex matches an IPv4 address such as 192.168.12.123:import {regex} from 'regex';const ipv4 = regex`\b (?<byte> 25[0-5] | 2[0-4]\d | 1\d\d | [1-9]?\d) # Match the remaining 3 dot-separated bytes (\. \g<byte>){3}\b`;You can take this even further by defining subpatterns for use by reference only via subroutine definition groups. Heres an example that improves the regex for admittance records that we saw earlier in this article:const record = 'Admitted: 2024-01-01\nReleased: 2024-01-03';const re = regex` ^ Admitted:\ (?<admitted> \g<date>) \n Released:\ (?<released> \g<date>) $ (?(DEFINE) (?<date> \g<year>-\g<month>-\g<day>) (?<year> \d{4}) (?<month> \d{2}) (?<day> \d{2}) )`;const match = record.match(re);console.log(match.groups);/* { admitted: '2024-01-01', released: '2024-01-03'} */A Modern Regex Baselineregex includes the v flag by default, so you never forget to turn it on. And in environments without native v, it automatically switches to flag u while applying vs escaping rules, so your regexes are forward and backward-compatible.It also implicitly enables the emulated flags x (insignificant whitespace and comments) and n (named capture only mode) by default, so you dont have to continually opt into their superior modes. And since its a raw string template tag, you dont have to escape your backslashes \\\\ like with the RegExp constructor.Atomic Groups and Possessive Quantifiers Can Prevent Catastrophic BacktrackingAtomic groups and possessive quantifiers are another powerful set of features added by the regex library. Although theyre primarily about performance and resilience against catastrophic backtracking (also known as ReDoS or regular expression denial of service, a serious issue where certain regexes can take forever when searching particular, not-quite-matching strings), they can also help with readability by allowing you to write simpler patterns.Note: You can learn more in the regex documentation.Whats Next? Upcoming JavaScript Regex ImprovementsThere are a variety of active proposals for improving regexes in JavaScript. Below, well look at the three that are well on their way to being included in future editions of the language.Duplicate Named Capturing GroupsThis is a Stage 3 (nearly finalized) proposal. Even better is that, as of recently, it works in all major browsers.When named capturing was first introduced, it required that all (?<name>...) captures use unique names. However, there are cases when you have multiple alternate paths through a regex, and it would simplify your code to reuse the same group names in each alternative.For example:/(?<year>\d{4})-\d\d|\d\d-(?<year>\d{4})/This proposal enables exactly this, preventing a duplicate capture group name error with this example. Note that names must still be unique within each alternative path.Pattern Modifiers (aka Flag Groups)This is another Stage 3 proposal. Its already supported in Chrome/Edge 125 and Opera 111, and its coming soon for Firefox. No word yet on Safari.Pattern modifiers use (?ims:...), (?-ims:...), or (?im-s:...) to turn the flags i, m, and s on or off for only certain parts of a regex.For example:/hello-(?i:world)/// Matches 'hello-WORLD' but not 'HELLO-WORLD'Escape Regex Special Characters with RegExp.escapeThis proposal recently reached Stage 3 and has been a long time coming. It isnt yet supported in any major browsers. The proposal does what it says on the tin, providing the function RegExp.escape(str), which returns the string with all regex special characters escaped so you can match them literally.If you need this functionality today, the most widely-used package (with more than 500 million monthly npm downloads) is escape-string-regexp, an ultra-lightweight, single-purpose utility that does minimal escaping. Thats great for most cases, but if you need assurance that your escaped string can safely be used at any arbitrary position within a regex, escape-string-regexp recommends the regex library that weve already looked at in this article. The regex library uses interpolation to escape embedded strings in a context-aware way.ConclusionSo there you have it: the past, present, and future of JavaScript regular expressions.If you want to journey even deeper into the lands of regex, check out Awesome Regex for a list of the best regex testers, tutorials, libraries, and other resources. And for a fun regex crossword puzzle, try your hand at regexle.May your parsing be prosperous and your regexes be readable.
0 Commentarii 0 Distribuiri 258 Views