Steve HowardEliminating Deadlocks Caused By Foreign Keys with Large Transactions
Validating foreign keys requires additional work for data modification operations and can lead to deadlocks in certain scenarios. There are, however, ways you can work around this contention and eliminate the possibility of deadlocks while checking referential integrity.
In this technical note, we describe a recent test lab in which we were able to observe how SQL Server behaves at scale with tables that use foreign keys. We discuss what we learned about how SQL Server locking and access strategies can change as the data size changes and we describe the effects these changes can have. We also describe how the deadlocking was resolved in our case by preventing different transactions from requesting locks on the same rows or pages, forcing a direct access of the rows and forcing all locks to be taken at the lowest possible granularity. This allowed for very high levels of concurrent imports on tables that use foreign keys.
In a typical parent-child modification operation, one table is modified before the related table or tables are modified. In the case of inserts, the parent table is modified first; when the child table receives the corresponding insert, a verification step is initiated. In the case of deletes, the child table is removed first; when the parent table is deleted, a verification step is carried out to ensure that no child records exist for the parent record. It is in the verification steps that deadlocking can occur.
When Microsoft SQL Server operates with large INSERT … SELECT combo statements or with large DELETE statements, SQL Server might choose verification methods that increase the chances of deadlocking. Disabling page locks and lock escalation on certain objects involved in these transactions and specifying OPTION (LOOP JOIN) might be necessary to prevent deadlocking in systems that have multiple concurrent large transactions on tables that use foreign keys.
In a recent test lab, we were able to observe how SQL Server behaves at scale with tables that use foreign keys and we tested these locking and access strategies that can prevent deadlocking. In our case, 24 or 48 processes imported data concurrently. Each of these processes performed a bulk insert into a local temporary table where the data was scrubbed and converted. Each process then used an INSERT … SELECT combo statement to move the scrubbed data into the permanent tables.
Our system performed well with low volumes of data in the tables and with low volumes of data in the import transactions. However, severe deadlocking surfaced that effectively prevented concurrent imports as the tables grew and as the size of the data imported from the temporary staging tables expanded. We observed that this deadlocking occurred even though different import processes dealt strictly with different sets of data that were loaded into the same tables.
To resolve deadlocking, we typically resulted had to rollback most of the concurrent transactions, a major setback when processing these large amounts of data. Moreover, retrying the rolled-back transactions often resulted in additional deadlocks.
Validation of Foreign Keys
In our benchmark, foreign keys were a critical part of the database design. Dropping foreign keys was not considered a viable option. Everyone understood that foreign keys would require additional work for data manipulation operations, but this was deemed to be an equitable tradeoff for maintaining the data integrity, regardless of the source of the data manipulation operation.
SQL Server places the validation of foreign keys into the query plan after the data is inserted into the target table. You see the validation of foreign keys as a join to the table against which the data must be validated.
Figure 1 shows a query plan with foreign key validation.
Figure 1: Execution plan showing SQL Server using a loop join to verify referential integrity
In Figure 1, an INSERT … SELECT statement is used to insert data into dbo.referringTable from #detailData. The table dbo.referringTable has a foreign key that references dbo.HeaderTable.
Starting from the upper right, you can see the flow of the data as SQL Server first scans the #detailData table for the data to be inserted and then inserts the data into dbo.referringTable. After inserting the data into the base table, SQL Server sorts the data so that the rows can also be inserted into the non-clustered index ncl_referringTable. This sort makes the insert into that non-clustered index most efficient, but note that an estimated 42% of the total query cost is spent on sorting the data in the order of the non-clustered index.
After the data is inserted into dbo.referringTable and all of its associated indexes, SQL Server verifies that referential integrity is maintained by using a loop join operation to perform a left semi-join to dbo.HeaderTable. Note that the row count spool operation that is used to support this join accounts for an estimated 40% of the total query cost. More importantly, the verification does not take place until after the data is inserted into the target table (dbo.referringTable, in this case). It is only at this point that SQL Server can verify whether the data inserted into the target table violates the foreign key constraint, so it is at this time that SQL Server determines whether the transaction can continue or if it must be rolled back.
Note SQL Server acquires shared locks when validating foreign keys, even if the transaction is using read committed snapshot (read committed using row versioning) or snapshot isolation level. Be mindful of this when examining deadlock graphs from transactions when these transaction isolation levels are used. If you see shared locks, check to see whether the locks are taken on an object that is referenced by a foreign key.
Contention Issues as Transactions Grow
As the amount of data grows, the query plan is likely to change, typically when statistics get updated (automatically or manually) or when the SQL Server Query engine determines that a significant change in table size necessitates a different query plan. Additionally, when an index is created, the statistics are updated with fullscan, also causing a potential change in the query plan.
For large datasets, when the clustered index is added (whether the index is added before or after the data is inserted) can determine whether SQL Server uses a nested loop join or a merge join during the foreign key validation.
In the query plan shown in Figure 1, the temporary table from which the data to be inserted is selected has no clustered index key. The statistics have probably been updated, but only by auto-create and auto-update statistics. This means that the table might contain significantly more data than the statistics indicate. Therefore, we added a clustered index to the #detailData temporary table after data was added in an effort to improve join efficiencies (resulting in up-to-date statistics on #detailData’s new index).
Figure 2 shows the query plan for the same INSERT … SELECT statement.
Figure 2: Execution plan showing SQL Server using a merge join to validate referential integrity
Figure 2 and Figure 1 show query plans for the same query, but because SQL Server recognizes that a larger dataset is being inserted in Figure 2, SQL Server replaces the loop join to validate referential integrity in Figure 1 with a merge join in Figure 2. This change eliminates the row count spool operation, but the most significant effect it has, from a concurrency standpoint, is that the clustered index seek on HeaderTable.PK_HeaderTable is replaced by a clustered index scan on HeaderTable.PK_HeaderTable.
Understanding the potential deadlocking
The new query plan in Figure 2 sets up potential deadlocks in transactions. Consider the following progression with two transactions, T1 and T2:
1. T1 opens a transaction and inserts a few rows into HeaderTable. Because only a few rows are inserted, an intent exclusive (IX) lock is taken on HeaderTable and on the pages into which the rows will be inserted. Exclusive key (X) locks are taken on the rows that are actually inserted.
2. T2 opens a transaction and inserts a few rows into HeaderTable. Because only a few rows are inserted, an IX lock is taken on HeaderTable and on the pages into which the rows will be inserted. Exclusive key locks are taken on the rows that are actually inserted. No blocking has occurred at this point.
3. T1 begins inserting data into ReferringTable. Locks are taken as appropriate on this table.
4. T2 begins inserting data into ReferringTable. Locks are taken as appropriate on this table. Lock escalation is prevented because multiple transactions concurrently hold locks on this table.
5. When T1 data is inserted into ReferringTable, T1 begins the scan of HeaderTable.PK_HeaderTable to verify referential integrity. Before the scan is complete, T1 is blocked in its attempt to read the rows that have been inserted into HeaderTable by T2.
6. When T2 data is inserted into ReferringTable, T2 begins the scan of HeaderTable.PK_HeaderTable to verify referential integrity. Before the scan is complete, T2 is blocked in its attempt to read the rows that have been inserted into HeaderTable by T1. This is now a deadlock.
Blocking occurs on HeaderTable when referential integrity is validated because shared locks are taken for this validation, even if the transaction itself is using read committed snapshot isolation or snapshot isolation levels.
We also encountered deadlocking because, in some instances, a row locking strategy was chosen for inserting rows, but a page locking strategy was chosen for validation of referential integrity. In this case, T1 and T2 could deadlock when trying to validate referential integrity because a shared lock on the page was requested by one transaction to validate the referential integrity, but the other transaction had an IX lock on the page because of X locks on rows inserted into that page.
Eliminating the Contention
To work around the contention, it is necessary to prevent different transactions from requesting locks on the same rows or pages. The general idea is to force a direct access of the rows and to force all locks to be taken at the lowest possible granularity.
In our benchmark, we used the following steps:
1. Ensure that different transactions process different data.
2. Disable lock escalation on HeaderTable.
3. Disable page locking on indexes on HeaderTable.
4. Hint the INSERT … SELECT query with OPTION (LOOP JOIN).
Ensure that different transactions process different data
In the case of our benchmark, there were 24 or 48 processes running concurrently to import data, and each processed a different dataset. For inserts, data must first be inserted into HeaderTable and then the details that reference those header records are inserted into detailTable. Mutually exclusive datasets were an integral part of the process; if the datasets were not mutually exclusive, there would have been primary key violations when inserting data into the HeaderTable. Note that it might be possible for multiple processes that all refer to the same row in the HeaderTable to insert data into detailTable; however, the processes must address the situation to prevent accessing the same rows.
Disable lock escalation on HeaderTable
If locks are allowed to escalate on HeaderTable, different types of deadlocking might occur. In our case, lock escalation on HeaderTable would result in severe contention, preventing the required level of concurrency. Because we also planned to disable page locking, we would have forced inserts into HeaderTable to take row locks, and single statements could reach the escalation threshold much faster than if the statement were able to take page locks. For the sake of concurrency, therefore, we disabled lock escalation.
Lock escalation can be enabled or disabled on a table by using the ALTER TABLE syntax. To disable lock escalation on HeaderTable, execute the following query:
ALTER TABLE HeaderTable SET (LOCK_ESCALATION = DISABLE)
Note Disabling lock escalation is not appropriate for all scenarios; in some cases, disabling lock escalation can cause out-of-memory errors in SQL Server. We recommend disabling lock escalation only for dealing with specific contention issues and only after extensive testing. In some cases, lock escalation can be disabled temporarily for processes that are known to have contention issues.
Note that you can determine whether lock escalation is allowed or disabled on a table by querying sys.tables.
Disable page locking on indexes on HeaderTable
You can disable page locking on indexes to keep different granularities of locks from causing deadlocks. Disabling page locking on indexes eliminates the possibility of contention that is caused by the inserts holding key locks (which implies IX locks are held on the containing pages) and by the validation of referential integrity taking shared page locks. If page locking is disaabled, the process of checking foreign keys takes row locks, even if a scan is performed.
To disable page locking on HeaderTable.PK_HeaderTable, you can execute the following query:
ALTER INDEX PK_HeaderTable ON HeaderTable SET (ALLOW_PAGE_LOCKS = OFF)
Note Disabling page locking is not appropriate for all scenarios; in some cases, disabling page locking can cause additional memory contention or can lock escalation, which can increase lock contention. We recommend disabling page locking on an index only for dealing with specific contention issues and only after extensive testing.
Note that you can determine whether page locking and row locking are enabled by querying sys.indexes.
Hint the INSERT … SELECT query with OPTION (LOOP JOIN)
If a scan is performed to support a merge or hash join, every row in that index or table must be read. If another transaction holds a lock on a row, the scan cannot complete until the lock on that row is released. This sets up the possibility of deadlock. To eliminate this possibility, you can force an access strategy that avoids touching rows locked by other transactions. We can use the OPTION (LOOP JOIN) when we know that the query must validate foreign keys to prevent the deadlock.
If an index exists to support a seek on the inner table, SQL Server can navigate from the root of that index directly to the leaf page where the row that needs to be accessed exists when a loop join is used. No other rows need be checked. This eliminates the possibility of a query being blocked by locks on rows in the table that are locked by other transactions. (We eliminate the scan by hinting for a loop join to be used, and this prevents the blocking and deadlocking.)
With normal joins, for example, you can use the INNER LOOP JOIN or the LEFT OUTER LOOP JOIN syntax in the FROM clause of the query to specify a join type. However, the table that is referenced by a foreign key is not explicitly stated in the query. Because SQL Server uses a join to validate foreign keys, you can still bring about the chosen join strategy by using the OPTION clause in the query.
In the query plans shown in Figures 1 and 2, the query text was as follows:
INSERT INTO referringTable SELECT * FROM #detailData
This query allowed SQL Server to choose the join type that was used to validate the referential integrity. In the two query plans, SQL Server chose two different join strategies. To force SQL Server to use a loop join to validate the referential integrity, the query should read as follows:
INSERT INTO referringTable SELECT * FROM #detailData
OPTION (LOOP JOIN)
When looking at the resulting query plan, you will notice that the overall estimated subtree cost in the execution plan is higher with the OPTION (LOOP JOIN) hint than it was when SQL Server chose the scan and merge join. This cost difference is the reason SQL Server chose the scan and merge join. However, in our case, the contention prevented concurrency and therefore limited the overall throughput. It was worthwhile to choose a query that had a higher estimated cost to gain the concurrency and to allow for a higher level of overall throughput. You can determine whether this is true in your environment only by testing.
Challenges to using the OPTION (LOOP JOIN) hint
Using the OPTION (LOOP JOIN) hint causes SQL Server to use a loop join to perform all joins in the query, not just joins that are used to validate foreign keys. If SQL Server previously found another join strategy to be more efficient on other joins, it will now use a loop join operator to perform those joins as well.
In some cases, this can cause a cost explosion. You cannot hint those other joins specifically to use a different join type, because this will result in an error informing you that you have conflicting join hints. If you must use the OPTION (LOOP JOIN) hint, consider the following tips:
Look at the execution plan on the other joins. Ensure that the inner table (the table that appears on the bottom on the execution plan) has an index to support the loop join without scanning. For example, consider the following query:
SELECT l.col1, l.col2, r.col3, r.col4
FROM leftTable l
JOIN rightTable r ON l.col1 = r.col1 AND l.col2 = r.col2
OPTION (LOOP JOIN)
If the execution plan shows that rightTable is the inner table, ensure that rightTable has an index on (col1, col2) or on (col2, col1), whichever is appropriate in the situation, to avoid scanning on the inner table. A loop join that has a scan on the inner table is very inefficient and can result in high CPU utilization or other resource contention.
Loop joins for which the seek on the inner table results in a range scan can also be very inefficient and might result in high CPU utilization or other resource contention. Try to avoid situations in which many rows can be returned from the inner table on any one join value combination.
As with any possible solution, hinting a loop join results in a tradeoff. In this case, we are trading available CPU resources for better concurrency. Be sure to test this solution thoroughly in your environment to verify that the tradeoff is worthwhile.
Using foreign keys results in additional work for data manipulation operations in which referential integrity is enforced. The validation of the referential integrity is verified with a join. In cases in which the data being inserted is small enough, SQL Server defaults to the use of a loop join. In cases in which the dataset being inserted is known to be large, SQL Server defaults to the use of another join strategy that necessitates a scan of the referenced table. Scanning can result in deadlocking with concurrent insert operations.
In some cases, transactions may use a key locking strategy when inserting into the referenced table and then use a page locking strategy when verifying referential integrity. This different granularity can also result in deadlocking.
SQL Server always uses shared locks when validating referential integrity. This is true even if the transactions are using read committed snapshot (read committed using row versioning) or snapshot isolation levels.
To eliminate the deadlocking when checking referential integrity, you can disable lock escalation on the referenced table, disable page locking on the referenced table, and hint OPTION (LOOP JOIN) on transactions that operate on mutually exclusive datasets. This forces SQL Server to lock the minimal amount of data and to use a more direct seek operation to access the referenced rows while checking referential integrity.