Scaling a notification service – Part 1

A reader reached out and asked for advice on how to create a notification service. In this series of posts, I will try to answer the question, but also the reasoning behind different approaches. We start from a simple approach, and then work to scale the service, and send as many notifications as we can.

The code is hosted here: PavlovicDzFilip/scalable-notification-service (github.com)

The solutions will be judged on:

  • Simplicity – how hard it is to understand by developers?
  • Scalability – how many notifications can we send?
  • Infrastructural complexity – do we need additional services, and what complexity do they bring?

Notes:

  • Sending a notification takes 10ms – when you are hosting on AWS, this is more than enough to call Amazon Simple Email Service – AWS
  • After sending a notification, an entry is added to the NotificationLog.
  • To send a notification, we need only little bits of information, which fit into a string of size 1000.
  • When testing throughput, we run the service for one minute, and log how many messages we have processed in that time.

Seeding data

The script used to fill in the database is:

declare @minimumRowsToHave int = 1000000;
declare @count int = (select COUNT(0) from Notifications)

if @count = 0 
begin
	insert into Notifications values (1, '{}', GETDATE())
end

while @count < @minimumRowsToHave
begin
	declare @max int = (select MAX(Id) from Notifications)

	insert into Notifications
	select Id + @max, Payload, SendDate FROM Notifications

	set @count = (select COUNT(0) from Notifications)
end

Inserting the data by using INSERT INTO * SELECT is far faster than creating a loop and iterating.

Solution 1 – Simple solution

Always start with a simple solution. It might take you just where you need to go. The simplest thing that just might work for your case is to loop through the records and handle them one by one.

On my PC, this approach sends 2301 notifications per minute.

The code:

public class SimpleNotificationSender(
    IServiceProvider serviceProvider,
    IEmailService emailService)
{
    public async Task<int> Send(CancellationToken cancellationToken)
    {
        var totalNotificationsSent = -1;
        bool hasMoreNotifications;
        do
        {
            hasMoreNotifications = await TrySendNext();
            totalNotificationsSent++;
        } while (hasMoreNotifications && !cancellationToken.IsCancellationRequested);

        return totalNotificationsSent;
    }

    private async Task<bool> TrySendNext()
    {
        await using var scope = serviceProvider.CreateAsyncScope();
        await using var dbContext = scope.ServiceProvider.GetRequiredService<NotificationContext>();
        var notification = await dbContext.Notifications.FirstOrDefaultAsync(x => x.SendDate < DateTime.UtcNow);
        if (notification is null)
        {
            return false;
        }

        await emailService.Send(notification.Payload);
        dbContext.Notifications.Remove(notification);
        dbContext.NotificationLogs.Add(new NotificationLog
        {
            Id = notification.Id,
            SentDate = DateTime.UtcNow
        });

        await dbContext.SaveChangesAsync();
        return true;
    }
}
Judging the solution
  • Simplicity – This is as simple as it gets, 5/5
  • Scalability – 2301 per minute, 1/5. It is not possible to horizontally scale the solution.
  • Infrastructural complexity – It is as simple as hosting a new service. This service has to be singleton, as there are no synchronization mechanisms implemented. The code can’t be simply added to any existing service, because it will not run correctly in cases when that service is already scaled, 4/5

Understanding the bottleneck

So far, the bottleneck can be in one of these areas:

  • Database
  • CPU
  • The code

Every iteration takes on average 26 milliseconds. However, looking at CPU Utilization, it is on average 0.96% – almost idle. This is due to plenty of waiting with async-await, where the CPU is not really utilized. The CPU is not a bottleneck.

Regarding the database, it can handle the load so far with no issues. This is truly minimal load for any database engine.

The bottleneck is our code: it simply takes 26ms to handle a single message, even though we are not using the CPU fully. This means, we can start multiple threads at the same time to do this work in parallel, to use the remaining CPU time.

Stay tuned for the next post: We will introduce multithreading and start utilizing all CPU cores, but also make the solution horizontally scalable.