kaiuk Sep 10 2019 at 14:27

Enumerable: How to yield a business value

6 min

2.2K

.NET*C#*High performance*Microservices*ООP*

This article is a brief explanation about how using a common language keywords might have an influence on the budget of IT-infrastructure of a project or help to achieve some limitations/restrictions of hosting infrastructure and, moreover, will be a good sing of the quality and maturity of the source code.

For the demonstration of ideas, in the article will be using C# language, but most of the ideas may be translated into other languages.

From the set of language's features, from my point of view, 'yield' is the most undervalued keyword. You can read the documentation and find a huge bunch of examples on the Internet. To be short, let's say that 'yield' allow creating 'iterators' implicitly. By design, an iterator should expose an IEnumerable source for 'public' usage. And here the tricky starts. Because we have a lot of implementations of IEnumerable in the language: list, dictionary, hashset, queue and etc. And from my experience, the choice of one of them for satisfaction requirements of some business task is wrong. Moreover, all of this is aggravated by whatever implementation is chosen, the program 'just works' — this is what really needs for business, isn't it? Commonly, it works, but only until the service is deployed into a production environment.

For a demonstration of the problem, I suggest choosing very common business case/flow for most enterprise project which we can extend during the article and substitute some part of this flow for understanding a scale of influence this approach on enterprise projects. And it should help you to find your own case in this set to fix it.

Example of the task:

Load byline a set of records from a file or DB into memory.
For each column of the record change the value to someone other value.
Save the results of transformation into a file or DB.

Let's assume several cases where this logic may be applicable. At this moment, I see two cases:

It is maybe a part of flow for some console ETL application.
It is maybe a logic inside of action in Controller of MVC application .

If we paraphrase the task into a more technical manner, so it may be sound like this: "(1)Allocate an amount of memory, (2) load information into memory from persistence storage, (3)modify and (4)flush records changes in memory to the persistence storage." Here the first phrase in the description "(1)Allocate an amount of memory" may have a real correlation to your non-functional requirements. Because your job/service should 'live' in some hosting environment which may have some limitations/restrictions(for instance, 150Mb per micro-service) and to predict spendings on your service in budget, we should predict, in our case amount of memory which service will use (commonly we say about maximum amounts of memory). In other words, we should determine a memory 'footprint' for your service.

Let's consider a memory footprint for really common implementation which I observe from time to time in different codebases of enterprise projects. Also, you can try to find it in your projects too, for example, 'under the hood' of 'repository' pattern implementation, just try to find such words: 'ToList', 'ToArray', 'ToReadonlyCollection' and etc. All of such implementation means that:

1. For each line/record into file/db, allocates memory to hold properties of record from file/db (i.e. var user = new User() { FirstName = 'Test', LastName = 'Test2' })

2. Next, with help of, for example, 'ToArray' or manually, object's references are held into some collection (i.e. var users = new List(); users.Add(user)). So, it is allocated some amount of memory for each record from a file and not to forget about it, the reference is stored into some collection.

Here is an example:

private static IEnumerable<User> LoadUsers2()
        {
            var list = new List<User>();
            foreach(var line in File.ReadLines("text.txt"))
            {
                var splittedLine = line.Split(';');

                list.Add(new User()
                { 
                    FirstName = splittedLine[0],
                    LastName = splittedLine[1]
                });
            }

            return list;

            // or

            return File.ReadLines("text.txt")
                .Select(line => line.Split(';'))
                .Select(splittedLine => new User()
                { 
                    FirstName = splittedLine[0],
                    LastName = splittedLine[1]
                }).ToArray();
        }

Memory profiler results:

Exactly such picture I saw every time in prodaction environment before container stops/reloads due to hosting's resource limitation per container.

So, a footprint for this case, roughly, depends on the number of records into a file. Because memory allocates per record in the file. And, the sum of this small peases of memory give us a maximum amount of memory which may be consumed by our service — it is the footprint of the service. But is this footprint predictable? Apparently, no. Because we can not predict a number of records in the file. And, in most case, the file size exceeds the amount of allowed memory in hosting in several times. It means that it is hard to use such implementation in the production environment.

Looks like it is the moment to re-thinks such implementation. Next assumption may give us more opportunities to calculate a footprint for the service: «a footprint should depend on the size only ONE record in the file». Roughly, in this case, we can calculate the maximum size of each column of only one record and sum them. It is quite easy to predict the size of a record instead of prediction of the number of records in the file.

And it is really wondered that we can implement a service which may handle an unpredictable amount of records and constantly consumes only a couple of megabytes with help only one keyword — 'yield'*.

The time for an example:


    class Program
{
	static void Main(string[] args)
	{
		// 1. Load byline a set of records from a file or DB into memory.
		var users = LoadUsers();

		// 2. For each column of the record change the value to someone other value.
		users = ModifyFirstName(users);

		// 3. Save the results of transformation into a file or DB.
		SaveUsers(users);
	}

	private static IEnumerable<User> LoadUsers()
	{
		foreach(var line in File.ReadLines("text.txt"))
		{
			var splitedLine = line.Split(';');

			yield return new User() 
			{ 
				FirstName = splitedLine[0],
				LastName = splitedLine[1]
			};
		}
	}

	private static IEnumerable<User> ModifyFirstName(IEnumerable<User> users)
	{
		foreach (var user in users)
		{
			user.FirstName += "_1";

			yield return user;
		}
	}

	private static void SaveUsers(IEnumerable<User> users)
	{
		foreach(var user in users)
		{
			File.AppendAllLines("results.txt", new string []{ user.FirstName + ';' + user.LastName });
		}
	}

	private class User
	{
		public string FirstName { get; set; }

		public string LastName { get; set; }
	}
}

As you can see in the example above, there is allocates memory only for one object at a time: 'yield return new User()' instead of creating a collection and fills it with objects. It is the main point of optimization which allows us to calculate more predictable memory footprint for the service. Because we only need to know the size of two fields, in our case FirstName and LastName. When a modified user is saved into file (see File.AppendAllLines), the instance of the user object is available for garbage collection. And memory which is occupied by the object is deallocated (i.e. the next iteration of 'foreach' statement in LoadUsers), so the next instance of user object may be created. In other words, roughly, the same amount of memory replaces by the same amount of memory on each iteration. That is why we no need more memory than the size of a single record in the file.

Memory profiler results after optimization:

From another perspective, if we slightly rename a couple of methods in the implementation above, so that use can notice some meaningful logic for Controllers in MVC application:


private static void GetUsersAction()
{
    // 1. Load byline a set of records from a file or DB into memory.
    var users = LoadUsers();
    // 2. For each column of the record change the value to someone other value.
    var usersDTOs = MapToDTO(users);
    // 3. Save the results of transformation into a file or DB.
    OkResult(usersDTOs);
 }

One important note before code listing: most of the important libraries like EntityFramework, ASP.net MVC, AutoMapper, Dapper, NHibernate, ADO.net and etc expose/consume IEnumerables sources. So, it means in the example above that LoadUsers may be replaced by an implementation which uses EntityFramework, for example. Which loads data row by row from the DB table, instead of a file. MapToDTO may be replaced by Automapper and OkResult may be replaced by a 'real' implementation of IActionResult in some MVC framework or our own implementation base on network stream, for example:


private static void OkResult(IEnumerable<User> users)
{
    // you can use a networksteam implementation
    using(StreamWriter sw = new StreamWriter("result.txt")) 
    {
        foreach(var user in users)
        {
            sw.WriteLine(user.FirstName + ';' + user.LastName);
         }
      }
}

This 'mvc-like' example shows us that we still able to predict and calculate a memory footprint also for Web-application. But in this case, it will be depends on requests count also. For example, the non-functional requirements may sound in this way: «Maximum memory amount for 1000 request not more then: 200KB per user object x 1000 requests ~ 200MB».

Such calculations are very useful for performance optimization in case of scaling the web application. For instance, you need to scale your web application on 100 containers/VMs. So, in this case, to make a decision about how much resources you should allocate from hosting provider, so you can adjust the formula like this: 200KB per user object x 1000 requests x 100VMs ~ 20GB. Moreover, this is the maximum amount of memory and this is amount is under the control of your project's budget.

I hope that information from this article will be helpful and allow to save a lot of money and time in your projects.

Hubs: