http://blogs.clariusconsulting.net/kzu

Daniel Cazzulino's Blog

Go Back to
kzu′s Latest post

Making regex authoring easier to read and maintain

I’m spiking ideas on how to make my regular expressions easier to read and maintain for the dev who comes after me (that could be myself in 3 months, meaning I will surely have forgotten everything about how that crazy regex worked).

I’m aware and wary of fluent API alternatives to building regular expressions which IMO hinder readability more than anything. They are just too verbose.

So here’s a progression of options that I’m thinking of. I’d like to get your feedback on what makes for a readable pattern and how much you care about extension method pollution (since patterns are all strings, extension methods would need to “hang” there):

1:

// "old"-style, may need to duplicate patterns on the full expression
private const string StringValue = "(\"[^\"]+\"|[\\S]+))";

2:

// extracting component patterns to make it explicit what they match
private const string NonWhitespace = "[\\S]+";
private const string QuotedValue = "\"[^\"]+\"";

private const string StringValue = "(" + QuotedValue + "|" + NonWhitespace + ")";

3:

private const string NonWhitespace = "[\\S]+";
private const string QuotedValue = "\"[^\"]+\"";

// using extension methods on string to compose instead of string concat. pollutes string.[...].
// can't be a const anymore. methods all returns further strings
private static readonly string StringValue = QuotedValue.Alternate(NonWhitespace).Group();

4:

private const string NonWhitespace = "[\\S]+";
private const string QuotedValue = "\"[^\"]+\"";

// using an extension method entry point avoids polluting string.[...]
// note implicit cast to string from the fluent API "holder", as the methods are no longer extension
// methods, they live in a RegexBuilder fluent class)
private static readonly string StringValue = QuotedValue.Regex().Alternate(NonWhitespace).Group();

5:

private const string NonWhitespace = "[\\S]+";
private const string QuotedValue = "\"[^\"]+\"";

// (ab?)using operator overloading when possible. opinionated concat here, as we assume alternation is
// wrapped in a group always (most common scenario, explicit group captures can be used anyway).
private static readonly string StringValue = QuotedValue.Regex() | NonWhitespace;

 

For #5, are there other operator overloads that might be useful? “+” “&” etc? Not sure which ones map intuitively to regex, except for the obvious alternation one…

This will almost surely become a new NETFx so it’s easy to bring in to any project. Let me know what you think!

Comments

14 Comments

  1. To be honest, I think regular expressions are something devs should learn to read and write. The syntax may be wonky, but it’s not as difficult as it looks once you start using them. I certainly don’t think they warrant their own custom API.

    The hard part about regex is understanding the constituent parts – grouping, alternation, lookarounds, etc. It doesn’t look like the API would alleviate anyone from having to understand these semantics. Moreover, it seems to obfuscate them.

    Bottom line, I know what to expect from this:
    “(?\”[^\"]+\”|[\\S]+)

    I have no idea what this does:
    SomeString.Regex() | SomeOtherString

    Rather than a new API, I’d rather see a stable, tested library of common .NET regex’s (like the ones available on CPAN or PyPI).

    Just some thoughts….

  2. I am a beginner regex dev.. I separate my rules across multiple lines, split them up and run each regex. At the end of each line I place a regex comment.
    (^/)|(^\w)#Must start with slash or word
    ^((?!\.).)*$#Cannot contain dot
    ^((?!\?).)*$#Cannot contain ?
    ^((?!//).)*$#Cannot contain //
    ^(.?$|[^/].+|m[^\w].*)#Cannot start with a slash and word
    (?
    \\. # Capture an escaped character
    | # OR
    \[\^? # a character class
    (?:\\.|[^\]])* # which may also contain escaped characters
    \]
    | # OR
    \(\?(?# inline comment!)\#
    (?[^)]*)
    \)
    | # OR
    \#(?.*$) # a common comment!
    | # OR
    [^\[\\#] # capture any regular character – not # or [
    )*
    \z
    ";

    Match parsed = Regex.Match(regexPattern, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);

    string regexComments = "";
    if (parsed.Success)
    {
    foreach (Capture capture in parsed.Groups["Comment"].Captures)
    {
    if (regexComments != “”) regexComments += “\r\n”;
    regexComments += capture.Value;
    }
    }

    return regexComments;
    }

  3. Sorry, the above function has many lines stripped out. Damn you comment boxes which do this! See http://stackoverflow.com/questions/5073826/how-to-extract-regex-comment for

  4. [...] Making regex authoring easier to read and maintain – Daniel Cazzulino takes a look at 5 different ways of representing a regular expression in your code, each offering a different level of readability, maintainability, and ease of understanding what the regular expression is actually going to do. [...]

  5. I’m with beefarino on this. Option 1 is my favourite. If you’re struggling to interpret a large regex, I’d recommend pasting the string into a proper regex tool like expresso, which is both free and brilliant.

    If i came across options 2,3,4 or 5 and i *still* didn’t understand them, then it would be a lot more work to produce a string that i could paste into expresso.

  6. Using Regexp.Escape may improve readability when you have to escape a lot of characters.

  7. Here’s what I do:

    1. Place the text of all significant Regexs in a separate text filen that I include in the project as an embedded resource that looks similar to an INI file. I have a library routine that extracts the text and builds a regex out of it.

    2. Use the “Ignore Whitespace” option so I can spread the Regex out over several lines, add comments, etc. Just like regular code, spacing, indenting and comments can vastly improve usability. I can look at pretty hairy Regexs I wrote a year ago with little if any confusion.

    3. Use the Expresso or Regex Buddy to build and debug individual Regexs or even fragments of a Regex (do this even if you ignore everything else).

    4. Use the Explicit Capture option and Named Subexpressions to make it crystal clear what different Subexpressions do. I find the (?’name’ syntax more readable than the traditional (?

    5. Pick a style for the names in #4 and stick with it. I do lowercase, but you could argue for PascalCase.

    6. (optional) Make up your own extensions to the Regex syntax and do simple string replacement in to expand it. For example, I change \:V to a variable name in most languages – [a-zA-Z][a-zA-Z0-9_]. In have similar shortcuts for anything I use often. I have a boilerplate comment section I put at the top of the embedded resource file that explains how it all works.

    7. For simple one-liners where you want something lighter in a string, use the @”" form of the string if you’re in C# to eliminate double escapes. Also use the “Ignore Whitespace” option.

    8. Always set ALL options using an options switch at the start of the Regex – (?ix

  8. Oops… Send too soon.

    …clunky overloads.

    As you can see, it’s not Regexs that people struggle with, it’s the tooling. We’re all conditioned to think of a Regex as something in a string whether it’s delimited by “” or // with an immutable syntax, and that is what makes them so hard to work with.

  9. I have to say I think option 2 reads the best out of all those. It is clear and concise and it’s obvious what it’s supposed to do because the variable declarations are right there.

    I agree with beefarino except that it takes more cognitive effort for me to parse a whole regular expression in one go than to parse each of the variable definitions.

  10. I really like step 2, but I think in your zeal to make regexs readable your 5th step is just as confusing, if not moreso, than the original regex syntax.

  11. Everything is very open with a clear clarification of the issues.
    It was definitely informative. Your website is useful. Many thanks for sharing!

  12. Yoou can share the tunes you create with people around the globe.
    This enableds users to play video content
    from the tablet device directly on a HD elevision via Apple
    TV. Consider the all advantages and disadvantages of both of them and choose the choice that is
    going to be best for you too understand this application.

  13. Very nice write-up. I definitely appreciate this website. Keep writing!