Optimize ASP.NET Core memory with DATAS

.NET, .NET CORE

.NET 8 introduces a new Garbage Collector feature called DATAS for Server GC mode - let's make some benchmarks and check how it fits into the big picture.

Kenny Pflug

Kenny specializes on architecture and design of distributed systems based on ASP.NET Core. Furthermore, he likes to deal with .NET internals like memory management/garbage collection, threading and asynchronous programming, and performance optimizations.

The source code for this post can be found here:
https://github.com/thinktecture/feO2x-publishing/tree/datas/2023-08-31_DATAS/Code

TL;DR

Maoni Stephens, one of the lead architects of the .NET Garbage Collector (GC), recently published a blog post about a new .NET GC feature called Dynamic Adaption To Application Sizes (DATAS) which will come with .NET 8. This feature will automatically increase or decrease the number of managed heaps in Server GC mode during app runtime. It decreases the total amount of memory used by your .NET app (in my tests, about 8 times on an AMD 16 core processor with Simultaneous Multithreading enabled), making Server GC mode a viable option for memory-constrained environments like Docker containers or Kubernetes pods which have access to several logical CPU cores.

Let's start with a benchmark

When you run an ASP.NET Core application on .NET 7, put some stress on it by allocating objects, and track the Garbage Collector (GC) metrics, you might see something like this:

In the picture above, you can see that we start out at around 80MB of total memory, most of it being attributed to the .NET CLR (gray area in the diagram representing unmanaged memory). The managed heap is nearly empty, because our application just started. Once we call endpoints, objects get allocated in generation 0 of the Small Object Heap (SOH, blue area), and after a 1000 endpoint calls, we also allocated objects greater than 85,000 bytes in size which will be placed on the Large Object Heap (LOH, violet area). We allocate more memory and around the two minute mark, the first full compacting GC run occurs. Objects that survive in the SOH will be placed in generation 1 (thin red area), while the LOH/POH is simply freed. We then continue allocating and can identify that the next full compacting GC runs occur at 3:46, 5:32, and 7:25 minutes, respectively. The red area and green area (generation 1 and generation 2 of the SOH) stay pretty small because most of our objects are transient.

During the benchmark run, we used up to 390 MB of memory (including unmanaged memory). We could get away with less memory by enabling workstation mode (I’ll show you further down in the article how you can do that). The resulting graph might look something like this:

The first thing you should notice is the vastly different amount of memory used. We only use ~36MB of memory at max. After 1:40 minutes, total memory consumption stays stable at around 30 MB. We can also see there are a lot more jagged edges of generation 0 (blue area), indicating that compacting GC is run more often than in Server GC mode. But why is that?

Differences between Server GC mode and Workstation GC mode

The Workstation mode was originally designed for client applications. Back in the day, the threads executing app code were halted until a GC run was finished. In desktop apps, you do not want to introduce freezes for several milliseconds or even seconds, thus the Workstation GC was tuned to perform runs more frequently and to finish individual runs faster. Since .NET Framework 4.0, we also have background GC runs which minimize the time threads are blocked.

Server GC in contrast was designed for maximizing throughput for services which will receive short-lived requests over time. GC runs happen less frequently but may take longer. In the end, you will spend less time on GC runs and more time in your service code.

The most glaring difference is the following: Workstation GC only uses a single managed heap. A managed heap consists of the following sub-heaps:

the Small Object Heap (SOH) with its three generations 0, 1, and 2. Objects smaller than 85,000 bytes are allocated here.
The Large Object Heap (LOH) which is used for objects greater than or equal to 85,000 bytes.
The Pinned Object Heap (POH) which is mostly used by libraries that perform interop and pin buffers for that (e.g. for networking or other I/O scenarios).

In Server GC mode, you will have several of these managed heaps, by default one per logical CPU core, but this can be tuned via GCHeapCount.

The additional number of managed heaps, as well as the fact the GC runs are performed less often, are the important factors explaining why memory consumption is much higher in Server GC mode.

But what if you want to benefit from Server GC mode while also dynamically adjusting the number of managed heaps during runtime? A typical scenario would be a service that runs in the cloud and that must handle a lot of requests at certain burst times, but afterwards it should scale down to reduce memory consumption. Up until now, there was no way for you to achieve that except with restarting the service with different configuration values. Scaling up would also require a restart – thus many dev teams just tried to find a compromise via the GCHeapCount and ConserveMemory options.

And then along comes DATAS

This is where a new feature called Dynamic Adaption To Application Sizes (DATAS) comes into play. It will be available with .NET 8 and you can already try it out in the since preview 7 or in the current RC1. The results of the same benchmark with DATAS enabled look like this:

The important thing to note here: although we are running in Server GC mode, our process only used 48 MB of total memory at maximum with DATAS activated (you cannot use it in Workstation mode). GC runs occur more often than in the first diagram, we can see a ramp up in the beginning and a ramp down at the 3:40 mark, indicating a change in the number of managed heaps. In the end, this is approximately eight times less than the 390 MB of total memory in Server GC mode in .NET 7.

DATAS will operate in the following way during runtime:

The GC will start with only a single managed heap.
Based on a metric called “throughput cost percentage”, the GC will decide whether it is viable to increase the number of managed heaps. This will be evaluated on every third GC run.
There is also a metric called “space cost” which the GC uses to decide whether the number of managed heaps should be reduced.
If the GC decides to increase or decrease the number of managed heaps, it will block your threads (similarly to a compacting GC run) and create or remove the managed heap(s). Corresponding memory regions will be moved. The switch from segments to regions in .NET 6 and .NET 7 when it comes to the internal organization of memory within a managed heap makes this scenario possible to implement.

By the way: DATAS will not be available for .NET Framework 4.x, only for .NET 8 or later.

Benefits and drawbacks?

DATAS will allow you to use Server GC mode in memory-constraint environments, for example in Docker containers, Kubernetes pods, or App Service in Azure. During bursts where your service will be hit with a lot of requests, the GC will dynamically increase the number of managed heaps to benefit from the optimized throughput settings of Server GC. When the burst is over, the GC will reduce the number of managed heaps again, thus reducing the total amount of memory used by your app. Even during bursts, the GC might choose to increase the managed heaps to a number less than one per logical CPU core, so you might end up with your app using less memory in total, without you having to configure the number of managed heaps manually.

Please keep in mind: when your app only has a single logical CPU core available, you should always use Workstation GC mode. Server GC mode is only beneficial when your app has two or more cores available. Also, I would recommend to verify that you actually require Server GC mode. Use tools like K6 or NBomber to measure the throughput of your web app. If you designed the memory usage of your app carefully, you might see no difference in throughput at all. Always remember: the .NET GC will only perform its runs when you allocate memory.

How to try it out

To try out DATAS, you need to install the .NET 8 SDK, at least preview 7, create a .NET 8 app (e.g. ASP.NET Core) and then you can add the following two lines to your .csproj file:

				
					
<PropertyGroup>
    <ServerGarbageCollection>true</ServerGarbageCollection>
    <GarbageCollectionAdaptationMode>1</GarbageCollectionAdaptationMode>
</PropertyGroup>

You can also specify it via command-line arguments when building your project:

				
					dotnet build /p:ServerGarbageCollection=true /p:GarbageCollectionAdaptationMode=1

Or in runtimeconfig.json:

				
					"configProperties": {
    "System.GC.Server": true,
    "System.GC.DynamicAdaptationMode": 1
}

Or via environment variables:

				
					set DOTNET_gcServer=1
set DOTNET_GCDynamicAdaptationMode=1

Please keep in mind: you must not set the GCHeapCount option when using one of the methods above. If you do, the GC will just use the specified number of heaps and not activate DATAS.

Also important: if you want to run in Workstation mode, simply set ServerGarbageCollection or the corresponding config property/environment variable to false or zero, respectively.

But this leaves one question: what if you do not specify any of these options?

Which GC mode will my ASP.NET Core app use by default?

You can substitute this question with another question: how many logical CPU cores can your ASP.NET Core app access? If it is less than two, then it will use Workstation GC mode. Otherwise, Server GC mode will be activated by default. So be particularly careful when you specify the constraints for your app in Docker, Kubernetes or Cloud environments where you might suddenly end up in another GC mode taking up more memory than expected.

Discussion, Conclusion, and Outlook

In my opinion, DATAS is a great new feature which brings the benefits of Workstation GC and Server GC together: you start out with less memory and when a burst of requests comes in, the GC can dynamically scale its number of managed heaps up to improve throughput. When the number of requests decreases at some later point in time, the number of managed heaps can be decreased, too, to free up memory.

But also, the devil is in the details: when tracing ETW events with PerfView, the reported number of heaps in my benchmarks was always 1 – I will take a look at the official ASP.NET Core benchmarks to see how they traced the exact number of managed heaps. Another important aspect is the decision whether a scale up or scale down is performed: this happens on every third run of the GC, and normally a GC run is only triggered when memory is allocated and the allocation contexts of the threads do not have enough memory left. What if suddenly no allocations are performed (because no requests are incoming)? Will the number of heaps not decrease? And finally, we saw interesting behavior when it comes to the amount of GC runs with DATAS enabled: we could see that they were triggered significantly more often than in regular Server mode – how exactly do the number of GC runs relate to the number of managed heaps?

In the end, DATAS will probably be handled in a similar way to the regions feature: it was introduced in .NET 6, but only activated by default in .NET 7. I would expect that in .NET 8, you have to manually opt in to this feature, while in .NET 9, it might be on by default. We will see what the time brings.

Appendix: About the benchmarks

The code that was used to produce the graphs above is a simple ASP.NET Core Minimal API with a single endpoint, which looks like this:

				
					using System.Threading;
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Http;

namespace WebApp;

public static class Endpoint
{
    private static ulong _numberOfCalls;
    private static int[]? _currentArray;

    public static void MapEndpoint(this WebApplication app)
    {
        app.MapGet(
            "/api/call",
            () =>
            {
                var numberOfCalls = Interlocked.Increment(ref _numberOfCalls);
                if (numberOfCalls != 0 && numberOfCalls % 1000 == 0)
                {
                    var largeArray = new int[30_000];
                    Interlocked.Exchange(ref _currentArray, largeArray);
                }

                return Results.Ok(new NumberOfCallsDto(numberOfCalls));
            }
        );
    }
}

public sealed record NumberOfCallsDto(ulong NumberOfCalls);

When the endpoint is called, the _numberOfCalls static field is incremented by using the lock-free Interlocked.Increment method to avoid concurrency issues (several requests hitting the endpoint at once). Every 1000th call, a new large array is allocated in the LOH and the reference to the previous array is exchanged with the new one (see violet area in the diagrams). Also, every call allocates a single NumberOfCallsDto on the SOH (blue, red, and green areas in the diagram). Of course, there is additional overhead of everything that ASP.NET Core allocates for an HTTP request, like a DI container scope, the HttpContext instance and all objects that it references, etc.

This endpoint is then called via NBomber, a load testing tool for .NET. The client looks like this:

				
					using System;
using System.Net.Http;
using NBomber.CSharp;
using NBomber.Http.CSharp;

namespace BomberClient;

public static class Program
{
    public static void Main()
    {
        const int numberOfCallsPerInterval = 300;
        var interval = TimeSpan.FromSeconds(1);

        using var httpClient = new HttpClient();
        var scenario =
            Scenario
               .Create(
                    "bomb_web_app",
                    async _ =>
                    {
                        var request = Http.CreateRequest("GET", "http://localhost:5000/api/call");

                        // ReSharper disable once AccessToDisposedClosure
                        // HttpClient will not be disposed when this lambda is called
                        return await Http.Send(httpClient, request);
                    })
               .WithoutWarmUp()
               .WithLoadSimulations(
                    Simulation.RampingInject(numberOfCallsPerInterval, interval, TimeSpan.FromSeconds(20)),
                    Simulation.Inject(numberOfCallsPerInterval, interval, TimeSpan.FromMinutes(7)),
                    Simulation.RampingInject(0, interval, TimeSpan.FromSeconds(10))
                );

        NBomberRunner.RegisterScenarios(scenario).Run();
    }
}

Here, we instantiate a scenario which will ramp up to 300 calls per second in 20 seconds, then stays at this rate for 7 minutes, and then ramps down to zero calls per second within 10 seconds. Requests are sent to the endpoint using the NBomber.Http package.

The source code for this post can be found here.

The tests were executed on the following machine:

AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
64 GB DDR4-3400 RAM Dual Channel 16-16-16-36
Windows 11 Pro 22621.2134

Performance data was captured with JetBrains dotMemory 2023.2.1 and Perfview 3.1.5.

Current articles, screencasts and interviews by our experts

Don’t miss any content on Angular, .NET Core, Blazor, Azure, and Kubernetes and sign up for our free monthly dev newsletter.

Improved RAG: More effective Semantic Search with content transformations

One of the more pragmatic ways to get going on the current AI hype, and to get some value out of it, is by leveraging semantic search. This is, in itself, a relatively simple concept: You have a bunch of documents and want to find the correct one based on a given query. The semantic part now allows you to find the correct document based on the meaning of its contents, in contrast to simply finding words or parts of words in it like we usually do with lexical search. In our last projects, we gathered some experience with search bots, and with this article, I'd love to share our insights with you.

read article >

17.05.2024

| Sebastian Gingter

Angular

View Transition API Integration in Angular—a brave new world (Part 1)

If you previously wanted to integrate view transitions into your Angular application, this was only possible in a very cumbersome way that needed a lot of detailed knowledge about Angular internals. Now, Angular 17 introduced a feature to integrate the View Transition API with the router. In this two-part series, we will look at how to leverage the feature for route transitions and how we could use it for single-page animations.

read article >

15.04.2024

| Sascha Lehmann

.NET

Data Access in .NET Native AOT with Sessions

.NET 8 brings Native AOT to ASP.NET Core, but many frameworks and libraries rely on unbound reflection internally and thus cannot support this scenario yet. This is true for ORMs, too: EF Core and Dapper will only bring full support for Native AOT in later releases. In this post, we will implement a database access layer with Sessions using the Humble Object pattern to get a similar developer experience. We will use Npgsql as a plain ADO.NET provider targeting PostgreSQL.

read article >

15.11.2023

| Kenny Pflug

Optimize ASP.NET Core memory with DATAS

In this article:

Kenny Pflug

TL;DR

Let's start with a benchmark

Differences between Server GC mode and Workstation GC mode

And then along comes DATAS

Benefits and drawbacks?

How to try it out

Which GC mode will my ASP.NET Core app use by default?

Discussion, Conclusion, and Outlook

Appendix: About the benchmarks

Current articles, screencasts and interviews by our experts

Improved RAG: More effective Semantic Search with content transformations

View Transition API Integration in Angular—a brave new world (Part 1)

Data Access in .NET Native AOT with Sessions