SQL Server Agent is a great tool for automation, however there are a several points I would recommend as a best practice, obviously best practice is very much a subjective opinion!
Creating operators and configuring notifications against each job
Unreported job failures can result in SSIS packages not running, data not being updated and issues occurring on the server without anyone being aware. Often meaning that these generally only get picked up when a user reports out of date data or other performance/functionality issues. This is especially problematic when the job that fails is the backup job.
The solution is to create operators (those that will receive email alerts of job failures) and configure notifications against jobs, to ensure that notifications (emails) are sent out when jobs fail. This can be completed by carrying out the following steps;
Monitoring jobs for individual step failures
When configuring SQL server jobs its pretty common to configure steps to jump to the next step even on failure, although this means that even with the above configured, it won't report an individual step failure (unless its the final step).
The solution I have found is to produce a simple stored procedure, that checks for failed steps and sends an email to notify data engineers of these. This requires a few components and implementing this can be created using the steps below.
Monitoring overrunning jobs
Categorising jobs
SQL Server jobs have the option under the general tab (see below) to categorise the job, this is useful to do as it can allow jobs to be classified by their broad purpose, the team that looks after them and/or the area they serve. This can then be used to separate out performance reporting of jobs and more sophisticated distribution of job errors/failures. To support this a control to identify uncategorised jobs would also be useful, so that if they are relied upon that they are there.
Consider analysis to look at job performance
SQL agent writes all of its activity to system tables, this can then be drawn out using SQL and then analysed in Excel, SSRS or Power BI. With Power BI being the more elegant solution. This can allow analysis of overall job performance, propensity of job failures and breakdown of job performance by its component steps.
The only caveat to this is that by default SQL agent only holds logs for a certain period of time, the default history is 1,000 records the length of which that covers being based on how often and how many jobs are ran. The best solution I have found to this is to create a job that archives off job history, so that a longer history is present.
Add comment
Comments