.NET 7 Performance: Regular Expressions – Part 2

There is this popular quote by Jamie Zawinski: Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

In this second article of our short performance series, we want to look at the latter one of those problems.

In this article:

sg
Sebastian is a consultant at Thinktecture and Microsoft MVP. He specialises in Generative AI in the business environment and ASP.NET Core.

It is true that regular expressions, or regex in short, can confuse developers. I’ve seen a lot of badly written regex that are very confusing. It is also true that a lot of regular expressions are hard to read and to understand. And, yes, regex sometimes are thrown at problems where other solutions would work just as well. All that said, if you are doing text processing or need to search and identify patterns in large text-based inputs, regular expressions are probably a very well-suited and elegant tool to use for these specific tasks. So, when a regex is the correct tool to solve a specific job, we want to use that tool as best as we can.

This is the second article of the series about .NET 7 performance. 

  1. .NET 7 Performance: Introduction and Runtime Optimizations
  2. .NET 7 Performance: Regular Expressions
  3. .NET 7 Performance: Reflection (coming soon)

Regex in .NET

Compared to using regular expressions in other languages, specifically Perl or JavaScript, where there is even specific syntax within the language to use regex, it’s different in .NET. We need to use a class and not directly embed the regex in our code like we can do in JavaScript, and we also don’t have operators defined in the language to do matching like in Perl.

First, we can use static methods on the Regex class. We pass the input and the expression as a string to it and get the results. As another possibility, we can instantiate the Regex class, again with the expression provided as a string to the constructor, and then everything is done with methods on that instance. But the most important fact is, that in .NET, regular expressions are, by default, interpreted. This is relatively slow. To counter that, we have the possibility to tell .NET to also compile a Regex, which then is a lot faster.

So, what do I mean by that? A regular expression is a pattern, which can be recognized and executed by a finite state machine (FSM). This finite state machine that reads (or interprets) the regular expression pattern in .NET is a general-purpose state machine, capable of executing every regex that comes its way. This FSM is not optimized for any given regular expression when it was initialized with that expression. It can do the work, but its slow. It is, however, also possible to “compile” the regex. The .NET Regex system then algorithmically transforms or translates a specific regular expression pattern into IL (Intermediate Language) code for a distinct finite state machine, that corresponds to exactly the given pattern. This FSM then is specialized and optimized for this specific pattern, and can match it extremely fast, but the IL first needs to be generated and then JITed to be executed.

As we discussed in the [introduction article | LINK] of this performance series, performance often is a trade-off. In this case between a fast startup time (regex instantiation or compilation) and slower or faster throughput (regex execution).

Regex optimizations pre .NET 7

Let’s assume we have this method:

				
					using System; 

using System.Text.RegularExpressions; 

  

public class MyValidator 

{ 

    public static bool IsValidCurrency(string value) 

    { 

        var pattern = @"\p{Sc}+\s*\d+"; 

        var currencyRegex = new Regex(pattern); 

        return currencyRegex.IsMatch(value); 

    } 

} 
				
			

The pattern \p{Sc}+ matches one or more characters in the Unicode Symbol, Currency category, \s* matches zero or more whitespace characters and \d+ matches one or more decimal digits. So, this matches for currency formats where the currency symbol is in front of the number.

The method is slow on every call, as every time we create the currencyRegex instance, the regular expression is evaluated and prepared for execution (or in more technical terms: this new instance of the general-purpose FSM is initialized with the pattern). In existing versions of .NET, it already helped us out a bit with the static regex methods. We can use this approach instead:

				
					using System; 

using System.Text.RegularExpressions; 

  

public class MyValidator 

{ 

    public static bool IsValidCurrency(string value) 

    { 

        var pattern = @"\p{Sc}+\s*\d+"; 

        return Regex.IsMatch(value, pattern); 

    } 

} 
				
			

The static methods on the Regex class will, by default, cache the last 15 evaluated patterns, or already initialized FSMs, if you will. If we have more patterns in our application that change often, we can also change this amount of cached regular expressions. So, when we do this optimization, we can exchange some memory for otherwise recurring initialization costs.

Compiling a regex

Another optimization that .NET already provides us is a pre-compiled regex. Both approaches we already saw use interpreted regular expressions. The startup time is slow, yes, but it is still faster that what we are looking at now: 

				
					using System; 

using System.Text.RegularExpressions; 

  

public class MyValidator 

{ 

    private static Regex _regex = new Regex(@"\p{Sc}+\s*\d+", RegexOptions.Compiled); 

  

    public static bool IsValidCurrency(string value) 

    { 

        return _regex .IsMatch(currencyValue, pattern); 

    } 

} 
				
			

In this sample, we call the Regex constructor with the RegexOptions.Compiled. As mentioned above, this will translate the regex pattern into an optimized FSM and generate the IL code for that, which can then be executed much faster than the interpreted version. However, creating a new in-memory assembly, emitting the IL code to that assembly and then JIT-compiling this on-the-fly generated assembly for this single regex is quite some overhead and makes the startup time much, much slower than to initialize the general regex FSM. We also need to hold the reference to the compiled regex instance in memory to prevent further startup compilations.

There is another approach we can use, and I don’t want to hide this from you: With a few additional lines, you can write a tool that is able to generate a real .NET assembly file that holds one or multiple pre-compiled regular expressions. This is done at design time. You can then reference this assembly and call your pre-compiled regex methods from this assembly, moving the compilation overhead to the development and building phase. However, you need to build that tool and execute it to generate an additional assembly that is then referenced by your applications project before your actual project can be built. You see, in practice this is very cumbersome, and to be fair I never saw this done in real-world projects. Also, this only works if you have all the regex pattern(s) beforehand at design time. This can’t work when the regular expressions used in your project are not already known when compiling the regex assembly.

Regex optimizations in .NET 7 with source generators

However, if that restriction is met, then .NET 7 comes to our rescue. Like the LoggerMessage source generator I described in another article (German), .NET 7 brings us a regular expression source generator named GeneratedRegex. 

Before we dive into that, though, you should know almost everything of the RegexComiler was rewritten for .NET 7, so that the output is even more optimized than it was before with .NET 6. So even if we don’t use this new feature, chances are that our regular expressions could be executed a little bit faster just by using .NET 7 over an older version of .NET. 

But now, let’s look at our new implementation: 

				
					using System; 

using System.Text.RegularExpressions; 

  

public partial class MyValidator 

{ 

    [GeneratedRegex(@"\p{Sc}+\s*\d+")] 

    private static partial Regex CurrencyRegex(); 

  

    public static bool IsValidCurrency(string value) 

    { 

        return CurrencyRegex().IsMatch(value); 

    } 

} 
				
			

You see that the class has been declared partial. This is (sadly) required for the source generator to work: At compile time, it will create another part of this very class and generate the code into this invisible other partial declaration. 

When we build this, the regular expression engine will evaluate the pattern at compile time, and generate the partial implementation of our CurrencyRegex() method signature. There it will instantiate and cache an instance of a compiler-generated regex runner in a static variable. When you use Visual Studio, you can select the partial method declaration and select “Go to Definition” to view the generated source, or you can use a tool like ILSpy to look at the generated parts in the compiled assembly. 

I don’t want to copy all the method here, but just a snippet of it, to show you how that would look like: 

				
					private bool TryMatchAtCurrentPosition(ReadOnlySpan<char> inputSpan) 

{ 

    int pos = runtextpos; 

    int matchStart = pos; 

    /// […]    

    for (iteration = 0; (uint)iteration < (uint)slice.Length && char.GetUnicodeCategory(slice[iteration]) == UnicodeCategory.CurrencySymbol; iteration++) 

    { 

    } 

    if (iteration == 0) 

    { 

        return false; 

    } 

    slice = slice.Slice(iteration); 

    // […] 

    while (true) 

    { 

        // […] 

        int iteration2; 

        for (iteration2 = 0; (uint)iteration2 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration2]); iteration2++) 

        { 

        } 

        slice = slice.Slice(iteration2); 

        pos += iteration2; 

        for (iteration3 = 0; (uint)iteration3 < (uint)slice.Length && char.IsDigit(slice[iteration3]); iteration3++) 

        { 

        } 

        // […] 

    } 

} 
				
			

What you can spot in this generated code fragment is the search for a Unicode CurrencySymbol and then a bit further down you’ll see the check for whitespace and a digit. So, this is code specifically generated to match strings against our regular expression pattern as fast and as efficient as possible, with no added runtime overhead other than JIT-compiling our assembly.

Performance tests

In the sample repository for this article series, I prepared a project that uses BenchmarkDotNet to compare the performance of the different approaches. You can check out the repo, go to the regex folder and execute the benchmarks with dotnet run -c Release -f net7.0. This requires both the .NET 6 and the .NET 7 SDKs to be installed side-by-side, so that the runner can execute the samples on both runtimes. Be aware that running these benchmarks on .NET 6 and .NET 7 takes about 10 minutes, as every benchmark method is run often to get as valid results as possible, and this is done with 2 arguments (one that is a match and one that isn’t). Also be aware that you can’t compare the results from your machine to my measurements, as our CPU performance and memory bandwidth may vary and affect the numbers. Make sure to only compare benchmark values with each other when they come from the same machine. The benchmarks are in our example repository.

That said, let’s run the benchmarks and talk numbers.

Type
Library
License
Apache 2
Language
C# / .NET

As the name suggests, New_Instance is the very first interpreted variant where we create a new instance of the Regex class with our pattern every call. On .NET 6 this takes about 1.5 milliseconds and on .NET 7 this has improved to 1.37 ms.

The Cached_Instance does re-use the same instance for every call, and this is around 90ns on .NET 6 with only a marginal improvement of a mere single nanosecond on .NET 7.

Using the static method (that internally caches the interpreted regex) needs a little bit more with 107 ns on .NET 6, which can be explained by the need to look up regex in the cache before executing it. On .NET 7 this is down to 98 ns, but still a bit more than caching the instance yourself.

The really big number is creating a new instance with RegexOptions.Compiled each time we call it, with 1.800 milliseconds or almost 2 seconds on .NET 6. This involves emitting the IL code. On .NET 7 this is down to 1.3 seconds, an improvement of half a second – per call. So, if you want to use RegexOptions.Compiled, make sure to really cache your instance because not doing so is awfully expensive.

So, you usually do initial compilation that only once and then cache the compiled regex, which then needs 35 ns to execute on .NET 6 and pretty much the same on .NET 7. So, the first call, even on .NET 7, is about 1.3 seconds and all later calls are only about 1/3 of the time it takes to match it the interpreted way.

The new variant with the Regex source generator is only available on .NET 7, so I used compiler directives to replace that call with an empty statement on .NET 6. Therefore, the numbers above are about zero there. Each call on .NET 7 is about 35ns too, but it shaves off the first 1.3 seconds for the first runtime compilation and eliminates the need to cache the compiled instance, as the source generator already does that for us, too.

Conclusion

If you know your regular expressions at compile time, you can make use of the all-new .NET 7 GeneratedRegex source generator to create regular expression instances that have no added startup costs anymore. Even if you don’t want to do that or simply can’t, because the expression is only defined dynamically at runtime, and you don’t have to call that very often, you still get the benefit of the rewritten new regex compiler in .NET 7, which could execute the generic FSM a little bit faster than on .NET 6.

If you combine that with the JIT compiler improvements we discussed in the first article, you can be sure that .NET 7 tries everything it can – at compile and at runtime – to make your regex execution is as fast as possible while hiding pretty much everything behind the curtain.

Free
Newsletter

Current articles, screencasts and interviews by our experts

Don’t miss any content on Angular, .NET Core, Blazor, Azure, and Kubernetes and sign up for our free monthly dev newsletter.

EN Newsletter Anmeldung (#7)
Related Articles
AI
sg
One of the more pragmatic ways to get going on the current AI hype, and to get some value out of it, is by leveraging semantic search. This is, in itself, a relatively simple concept: You have a bunch of documents and want to find the correct one based on a given query. The semantic part now allows you to find the correct document based on the meaning of its contents, in contrast to simply finding words or parts of words in it like we usually do with lexical search. In our last projects, we gathered some experience with search bots, and with this article, I'd love to share our insights with you.
17.05.2024
Angular
sl_300x300
If you previously wanted to integrate view transitions into your Angular application, this was only possible in a very cumbersome way that needed a lot of detailed knowledge about Angular internals. Now, Angular 17 introduced a feature to integrate the View Transition API with the router. In this two-part series, we will look at how to leverage the feature for route transitions and how we could use it for single-page animations.
15.04.2024
.NET
kp_300x300
.NET 8 brings Native AOT to ASP.NET Core, but many frameworks and libraries rely on unbound reflection internally and thus cannot support this scenario yet. This is true for ORMs, too: EF Core and Dapper will only bring full support for Native AOT in later releases. In this post, we will implement a database access layer with Sessions using the Humble Object pattern to get a similar developer experience. We will use Npgsql as a plain ADO.NET provider targeting PostgreSQL.
15.11.2023