Entity Framework Extensions Improving Performance of BulkMerge with WhereBulkContains

One of our clients reported a performance issue when using our BulkMerge method with a custom key.

context.BulkMerge(importPickDatas, options => {
	options.ColumnPrimaryKeyExpression = x => new
	{
		x.FacilityID,
		x.CustomerID,
		x.PickupDate,
		x.OrderID,
		x.ProductNumber,
		x.Location
	};
});

In his case, he had to merge over 1 million rows in a table containing over 1 billion rows, and the process took more than 30 minutes to complete.

He asked us to look at his scenario and provide some guidance or recommendations about optimizing this process to reduce the overall execution time. After a few back and forth to better understand his scenario, we found the cause and 2 solutions.

Cause

SQL Server used the following index to solve the custom key:

CREATE NONCLUSTERED INDEX [IX_PickData] ON [dbo].[PickData]
(
    [FacilityID] ASC,
    [CustomerID] ASC,
    [PickupDate] ASC
)

However, we can notice that the following columns were missed but used in the custom key:

  • OrderID
  • ProductNumber
  • Location

When it happens, in addition to using the index, SQL Server has to access the clustered index to gain access to the additional column, worsening the performance and increasing the CPU usage. The impact in his case was huge.

It would have been fine if only done once, but in his case, it has to be done over 100 times. By default, a Batch Size of 10,000 rows is used in SQL Server, which means the query generated by the BulkMerge method has to be executed 100 times to merge the 1,000,000 rows.

Solution 1 - Covering Index

One first reaction was proposing to our client to create a new covering index for faster query execution.

A covering index is an index that includes all the columns required to execute a specific query. These columns can be part of:

  • The indexed columns: These are the columns on which the index is primarily sorted. They are typically used in the query WHERE, JOIN, ORDER BY, or GROUP BY clauses.
  • The INCLUDE section: These are additional columns included in the index but not used for sorting. These columns are typically listed in the SELECT clause of a query and are retrieved but not used for filtering or sorting (but can still be used).

So we proposed him to try the following covering index:

CREATE NONCLUSTERED INDEX [IX_PickData_Covered] ON [dbo].[PickData]
(
    [FacilityID] ASC,
    [CustomerID] ASC,
    [PickupDate] ASC
)
INCLUDE([OrderID], [ProductNumber], [Location])

A covering index comes with some benefits, such as:

  • Performance Boosts: Including more columns in the nonclustered index can result in fewer lookups to the clustered index or table, reducing the overall I/O cost.
  • Reduce CPU Utilization: Since all the necessary columns are present in the nonclustered index, SQL Server will have to do less work, thus potentially reducing CPU time.

But also some cons:

  • Coverage Index Trap: Implementing this solution once often leads to overusing this solution for other purposes that could be better solved.
  • Disk Space: Adding more columns to an index increases its size, which requires more disk space.
  • Slower Save Operations: Adding more index directly impact the performance when inserting, updating, and deleting data.

Our client implemented this solution and had some performance improvements. However, he asked us to find out if it would be possible to improve the performance, leading us to solution 2.

Solution 2 - WhereBulkContains

One question to our client was: "What percentage of the importation would usually lead to an update?". His answer was over 95% of the data will be updated during the BulkMerge.

With this information, we quickly jump to the solution to find the natural primary key using the help of the WhereBulkContains methods and:

  • BulkUpdate for data we know already exists in the database.
  • BulkInsert or BulkMerge for data that we know doesn't already exist or use:
    • BulkInsert: The BulkInsert can be directly used if we know that another process cannot insert the same data
    • BulkMerge: The BulkMerge needs to be used if another process could also add this data.

In his case, only this process could insert those customers into this table, leading us to the best scenario by allowing us to propose a solution with BulkInsert for data that did not exist.

So, we proposed the following solution:

  1. Retrieve all existing data using the WhereBulkContains methods.
  2. Create a dictionary using a combination of columns to create a custom key.
  3. Find all entities to be inserted using the dictionary.
  4. Find all entities to be updated using the dictionary (or excluding inserted entities from the original list).
  5. Insert data with BulkInsert.
  6. Copy the id retrieved previously for the entity to update.
  7. Update data with BulkUpdate using the natural primary key instead of a custom key.

We provided him with the following code:

var importPickDatas = new List<PickData>();
importPickDatas.Add(new PickData() { FacilityID = 1, CustomerID = 1, PickupDate = DateTime.Today, OrderID = 1, ProductNumber = "ProductNumber_1", Location = "Location_1", ColumnToModify = "UpdatedColumn_1" });
importPickDatas.Add(new PickData() { FacilityID = 2, CustomerID = 2, PickupDate = DateTime.Today, OrderID = 2, ProductNumber = "ProductNumber_2", Location = "Location_2", ColumnToModify = "UpdatedColumn_2"  });
importPickDatas.Add(new PickData() { FacilityID = 3, CustomerID = 3, PickupDate = DateTime.Today, OrderID = 3, ProductNumber = "ProductNumber_3", Location = "Location_3", ColumnToModify = "AddedColumn_3"  });

using (var context = new EntityContext())
{
	// 1. Retrieve all existing data using the WhereBulkContains methods.
	var existingPickDatas = context.PickDatas.WhereBulkContains(importPickDatas, x => new
		{
			x.FacilityID,
			x.CustomerID,
			x.PickupDate,
			x.OrderID,
			x.ProductNumber,
			x.Location
		}  
	).AsNoTracking().ToList();
	
	FiddleHelper.WriteTable("2 - PickData found with WhereBulkContains:", existingPickDatas);

	// 2. Create a dictionary using a combination of columns to create a custom key.
	var existingPickDataDict = existingPickDatas.ToDictionary(x => GetCustomPickDataKey(x));
	
	// 3. Find all entities to be inserted using the dictionary.
	var toInsertPickDatas = importPickDatas.Where(x => !existingPickDataDict.ContainsKey(GetCustomPickDataKey(x))).ToList();
	
	// 4. Find all entities to be updated using the dictionary (or excluding inserted entities from the original list).
	var toUpdatePickDatas = importPickDatas.Where(x => existingPickDataDict.ContainsKey(GetCustomPickDataKey(x))).ToList();

	// 5. Insert data with BulkInsert.
	{
		 context.BulkInsert(toInsertPickDatas);
	}

	// UPDATE existing PickData
	{
		// 6. Copy the `id` retrieved previously for the entity to update.
		toUpdatePickDatas.ForEach(x => x.PickDataID = existingPickDataDict[GetCustomPickDataKey(x)].PickDataID);
		
		// 7. Update data with [BulkUpdate](/bulk-update) using the natural primary key instead of a custom key.
		context.BulkUpdate(toUpdatePickDatas);
	}
	
	FiddleHelper.WriteTable("3 - PickDatas After:", context.PickDatas.AsNoTracking().ToList());
}

Try it

There is no batch when using the WhereBulkContains. All data is retrieved at once, which means that even the covering index solution was now optional as the index was only required once instead of being used over 100 times.

Solution 3 - Sync Properties from Database Values

A new solution is now available when using Entity Framework Plus. The feature SyncPropertiesFromDatabaseValues let you synchronizes one or more properties from your list by retriving the database values.

It's very similar to the solution #2, but way easier to implement:

  1. Sync ids using the SyncPropertiesFromDatabaseValues methods
  2. Find all entities to be inserted (id = 0)
  3. Find all entities to be updated (id != 0)
  4. Insert data with BulkInsert.
  5. Copy the id retrieved previously for the entity to update.
  6. Update data with BulkUpdate using the natural primary key instead of a custom key.

Conclusion

Our client finally ended up only implementing only the solution #1 as the covering index was fast enough for him. He told us he had been pleased and the performance had improved from more than 30 minutes to less than 5 minutes.


Last updated: 2023-09-27
Author:


Contents