Here is a DBA technical challenge problem I solved that got me through the first stage of the hiring process during a DBA job interview a few years ago. It is on SQL Server performance troubleshooting and tuning.
The question goes like this: You were recently hired at a company as the only SQL Server 2017 DBA. You had just been notified by the management that one of the SQL Server cluster environments has been performing very badly for about a week now and that immediate analysis should be conducted.
You are not yet familiar with the applications or the databases yet, but they told you to find the problem as soon as possible. How will you go about fixing this problem?
The question also requested for details on the following points:
Then it asked to explain things in a high level along with assumptions and recommendations.
Here is my answer:
Assuming the clusters are running on the same server, I will connect to the server and analyze the performance dashboard. I will use a combination of Activity Monitor, Database Engine Tuning Advisor, Report Performance Dashboard, and T-SQL DMVs for a more detailed analysis.
I will first start with the most common symptoms of a poorly performing database. These are slow-running queries and stored procedures, CPU, I/O, network, and memory bottlenecks. Specifically the first place to start analyzing is the waits.
If I see CXPACKET wait of more than 5% of all resource waits, then I would assume that there is a problem of parallelism bottleneck or slow-running queries. To identify and isolate slow-running queries that are as a result of table/index scans due to missing indexes or non-sargable predicates - predicates that perform table scans instead of index seeks - , I will look for queries with both CXPACKET and PAGEIOLATCH_XX waits.
Because I am not currently familiar with the databases, and therefore cannot go in to start refactoring the affected queries and stored procedures, I would recommend limiting parallelism for the entire instance by setting max degree of parallelism to a number that is less or equal to the total number of CPUs. For instance if I see that we have 5 processors, then I would recommend setting max degree of parallelism to <=5.
Report performance dashboard limits the number queries under review to 20. It also does not show query execution plans for the listed queries. Consequently I would use T-SQL queries with DMVs - Dynamic Management Views - below for more in-depth analysis. Performance dashboard does not associate waits to specific queries, however it gives me a quick look at where most of the waits might be occurring. With that I would be more specific in analyzing the wait types returned by the TSQL queries.
For queries with high logical reads and writes along with their execution plans I will use the below query:
SELECT TOP 50 SUBSTRING(QTEXT.TEXT, (QSTATS.statement_start_offset/2)+1, ((CASE QSTATS.statement_end_offset WHEN -1 THEN DATALENGTH(QTEXT.TEXT) ELSE QSTATS.statement_end_offset END - QSTATS.statement_start_offset)/2)+1) AS query, QSTATS.execution_count, QSTATS.last_execution_time, QSTATS.total_logical_reads, QSTATS.total_logical_writes, QSTATS.total_worker_time, QSTATS.total_elapsed_time/1000000 AS total_elapsedtime_insecs, QSTATS.last_elapsed_time/1000000 AS last_elapsedtime_insecs, QPLAN.query_plan FROM sys.dm_exec_query_stats QSTATS CROSS APPLY sys.dm_exec_sql_text(QSTATS.sql_handle) QTEXT CROSS APPLY sys.dm_exec_query_plan(QSTATS.plan_handle) QPLAN ORDER BY QSTATS.total_logical_reads DESC, QSTATS.last_execution_time DESC
Ordering the records by the last execution time will let me analyze executions for the past one week since that is when this problem was first observed.
To see slow-running queries with wait types and execution plans, I will use the following query. To see use the graphical interface version of the plans instead of xml, I would open each xml and save with
.sqlplan extension and then open it in SSMS for a more visual analysis.
SELECT QTEXT.TEXT AS query, WTASK.[wait_type], WTASK.[wait_duration_ms], WTASK.[blocking_session_id], WTASK.[resource_description], REQ.[database_id], QPLAN.[query_plan], REQ.[cpu_time] FROM sys.dm_os_waiting_tasks WTASK INNER JOIN sys.dm_exec_sessions SESS ON WTASK.[session_id] = SESS.[session_id] INNER JOIN sys.dm_exec_requests REQ ON SESS.[session_id] = REQ.[session_id] OUTER APPLY sys.dm_exec_sql_text (REQ.[sql_handle]) QTEXT OUTER APPLY sys.dm_exec_query_plan (REQ.[plan_handle]) QPLAN
Before analyzing the identified queries, I will clear database cache so as to get a more accurate performance of the queries. To do this I will run the following query in SSMS window.
DBCC dropcleanbuffers; DBCC freesystemcache('ALL'); GO
If I am unable to identify the issue using the above methods, I would use activity monitor to see and analyze current running queries to isolate the most expensive ones. I can right-click on the queries returned by the activity monitor to show their execution plans.
After analyzing identified queries with potential issues and found no issues, and if the databases? resource usage is within normal range, I would look at the available and used RAM and Hard Disk to see if hardware scaling could be recommended.
First I will connect to the server and open the performance dashboard under reports from the object explorer. The dashboard shows a summary of CPU utilization for both system and SQL sources as well as historical information section with links to more drill-down reports on waits, I/O statistics and expensive queries by CPU and logical reads.
I will analyze the drill-down reports of slow-running queries by duration, CPU, logical reads, and logical writes. I will look at each query's cumulative CPU by average, minimum, and maximum values. I will also look at the cumulative physical and logical reads and writes.
If I do not find any issues with logical reads and writes, I would look at the DMV query results for wait types of CXPACKET and PAGEIOLATCH_EX with high waiting duration values which may indicate a problem with query predicates (WHERE clauses) or slow-running queries. Additionally I will look at high NETWORK_IO wait types such as ASYNC_NETWORK_IO to see if there are applications that are taking long time to process database results. If there are none, then I would continue with other possible internal causes such as badly designed database schema and indexes.
I will check the predicates of the returned queries with potential issues for heavy use of functions that could be non-sargable. I will also check queries with missing or inefficient indexes. Additionally I will look at query execution plans returned by the top DMV queries above. I will specifically look for table scans and index spools which may show that SQL Server is creating indexes on the fly instead of using predefined indexes. I will look for warning messages on the plan as well.
More from @sqldibia