We have all been there, you are working away fine and then you hear the unmistakable tick tick tick sound of a failing hard disk. Of course today SSD’s are common place so a lot of you might be thinking, what is he on about, but if you ask those around in IT long enough, they will tell you stories of how the impending sound of doom when your hard disk starts to fail.
The trouble of course is how do we monitor for signs of a disk failure and more importantly, how can we be proactive about replacing hardware before something goes to the scrap heap.
Monitoring Hard Disk Health
There are several ways with modern machines that we can monitor hard disk health, from in-built SMART (Self-Monitoring, Analysis, and Reporting Technology) errors, do read error counts, and through to wear values, with most values readily available to query via WMI. Lets take a look at each of these in isolation first.
SMART
The Self-Monitoring, Analysis, and Reporting Technology feature that is built into most hard disks today, is a self monitoring system that allows the disk to report on predicted and determined anomalies during normal operation. When an anomaly is detected, a corresponding error code is set which can then be read by software or other means to warn you that something isn’t quite right with your hard disk.
Below are a sample list of failures errors (source S.M.A.R.T. – Wikipedia)
We can integrate these values using the MSStorageDriver_FailurePredictStatus class in the root\WMI namespace;
Get-WmiObject -namespace root\wmi –class MSStorageDriver_FailurePredictStatus
Drive Health Status
Windows also maintains its own health monitoring state for storage, which can be queried using the Get-PhysicalDisk cmdlet in PowerShell;
Get-PhysicalDisk
Using this we have a quick and easy means of obtaining a health value for the disk.
Read / Write Error Count
Read and write error counts often signify poor cells or sectors on the disk and can be used by some OEM’s to replace a hard disk. In this instance, the values are captured and flagged once they go beyond a predefined maximum contained within the script.
Temperature
As with all electronics, temperature plays a vital part in the longevity of the component. This is why of course server rooms as an example are maintained at low temperatures, ensuring that the hardware provides as long as possible before failure. The same then is true of your laptop or desktop drive, and again we can monitor this value where the drive supports these values.
Get-PhysicalDisk | Get-StorageReliabilityCounter | Select-Object -Property DeviceID, Wear, ReadErrorsTotal, ReadErrorsCorrected, WriteErrorsTotal, WriteErrorsUncorrected, Temperature, TemperatureMax | FT
In researching this I found that not all disks report their manufacturer max temperature states, so determining if the drive is out of spec can’t always be relied upon, although you could of course determine an average across your devices and set that as your lowest common warning state.
Read / Write Errors
In the previous screenshot you can see using that query in PowerShell we were able to obtain information including read and write errors, looking at this on a machine with issues, we can see the counters being incremented;
Wear Value
There is another interesting value that we can obtain, this is the “wear” value. This value is set between 0-100 where 100 indicates that the drive has reached the end of its useable life. The Microsoft definition of this value is stated below;
Wear
Data type: UInt8
Access type: Read-only
The storage device wear indicator, in percentage. At 100 percent, the estimated wear limit will have been reached.
Source – MSFT\_StorageReliabilityCounter class | Microsoft Docs
Where SSD’s are concerned, I am assuming (although not confirmed) that the wear value is checking the amount of spare cells being used, as SSD’s have in-built redundancy to provide the stated storage value as long as possible and thus using the spare cells as normal cells reach their end.
Endpoint Analytics / Proactive Remediation’s FTW (Again)
Given that we have all of these values I thought wouldn’t it be useful to monitor these, or indeed notify the end user of the impending doom that faces them. Lets face it no one likes a machine failing when they are trying to work, and with remote working being common place now, obtaining a machine or disk replacement quickly could pose a challenge.
So first of all let me state that the monitoring solution is provided as a “predictive” or “proactive” solution and after speaking with the main OEM’s on this, typically drives will ONLY be replaced when they suffer from a hard fail. With that said, if your machines are out of warranty, then this issue really doesn’t apply to you, as a replacement is going to cost money and its up to you to way up the cost of potential productivity loss versus the cost of a replacement drive.
With that in mind then, we can put this all together in PowerShell, and start monitoring!
Detection Script
The first thing we have to do is build our detection script, which should take into account all of the above, using the drive health and SMART values first as the primary failure detection methods, and thereafter using error and wear states. Of course the idea here is to do an “exit 1” triggering the remediation script should anything be out of normal when it comes to the disk health states.
<# .NOTES =========================================================================== Created on: 21/04/2021 11:00 AM Created by: Maurice Daly Organization: CloudWay Filename: Invoke-DiskHealthCheck.ps1 =========================================================================== .DESCRIPTION Monitors WMI values for hard disk health, helping you predict or detect anomalies and be preactive about hard disk replacement. #> #region ScriptVariables # Define variables $Organisation = "YOUR ORG DETAILS" $MaxWearValue = 90 $MaxRWErrors = 100 $RegistryBase = "HKLM:\SOFTWARE\$Organisation\Monitoring\Disk Health" # Obtain physical disk details $Disks = Get-PhysicalDisk | Where-Object { $_.BusType -match "NVMe|SATA|SAS|ATAPI|RAID" } #endregion ScriptVariables #region ScriptFunctions function Write-RegistryEntries { # Set disk registry path $RegistryPath = Join-Path -Path $RegistryBase -ChildPath "Disk $($Disk.DeviceID)" # Create disk registry key if not present if (-not (Test-Path -Path $RegistryPath)) { New-Item -Path $RegistryPath -Force | Out-Null } # Set registry values and warning message New-ItemProperty -Path $RegistryPath -Name "Friendly Name" -Value $($Disk.FriendlyName) -PropertyType "String" -Force | Out-Null New-ItemProperty -Path $RegistryPath -Name "Health Status" -Value $DriveHealthState -PropertyType "String" -Force | Out-Null New-ItemProperty -Path $RegistryPath -Name "Media Type" -Value $DriveMediaType -PropertyType "String" -Force | Out-Null New-ItemProperty -Path $RegistryPath -Name "Wear" -Value $([int]($DiskHealth.Wear)) -PropertyType "String" -Force | Out-Null New-ItemProperty -Path $RegistryPath -Name "Read Errors" -Value $([int]($DiskHealth.ReadErrorsTotal)) -PropertyType "String" -Force | Out-Null New-ItemProperty -Path $RegistryPath -Name "Temperature Delta" -Value $DiskTempDelta -PropertyType "String" -Force | Out-Null New-ItemProperty -Path $RegistryPath -Name "Read Errors Uncorrected" -Value $($Disk.ReadErrorsUncorrected) -PropertyType "String" -Force | Out-Null New-ItemProperty -Path $RegistryPath -Name "Read Errors Total" -Value $($Disk.ReadErrorsTotal) -PropertyType "String" -Force | Out-Null New-ItemProperty -Path $RegistryPath -Name "Write Errors Uncorrected" -Value $([int]($DiskHealth.WriteErrorsUncorrected)) -PropertyType "String" -Force | Out-Null New-ItemProperty -Path $RegistryPath -Name "Write Errors Total" -Value $([int]($DiskHealth.WriteErrorsTotal)) -PropertyType "String" -Force | Out-Null New-ItemProperty -Path $RegistryPath -Name "Output" -Value $OutputMsg -PropertyType "String" -Force | Out-Null } #endregion ScriptFunctions #region ScriptRunningCode # Create root registry key if not present if (-not (Test-Path -Path $RegistryBase)) { New-Item -Path $RegistryBase -Force | Out-Null } # Loop through each disk foreach ($Disk in ($Disks | Sort-Object DeviceID)) { # Set initial output variable state $OutputMsg = $null # Obtain disk health information from current disk $DiskHealth = Get-PhysicalDisk -FriendlyName $($Disk.FriendlyName) | Get-StorageReliabilityCounter | Select-Object -Property Wear, ReadErrorsTotal, ReadErrorsUncorrected, WriteErrorsTotal, WriteErrorsUncorrected, Temperature, TemperatureMax # Obtain media type $DriveDetails = Get-PhysicalDisk -FriendlyName $($Disk.FriendlyName) | Select-Object MediaType, HealthStatus $DriveMediaType = $DriveDetails.MediaType $DriveHealthState = $DriveDetails.HealthStatus $DiskTempDelta = [int]$($DiskHealth.Temperature) - [int]$($DiskHealth.TemperatureMax) # Obtain SMART failure information $DriveSMARTStatus = (Get-WmiObject -namespace root\wmi -class MSStorageDriver_FailurePredictStatus -ErrorAction SilentlyContinue | Select-Object InstanceName, PredictFailure, Reason) | Where-Object { $_.PredictFailure -eq $true } # Create custom PSObject $DiskHealthState = new-object -TypeName PSObject # Create disk entry $DiskHealthState | Add-Member -MemberType NoteProperty -Name "Disk Number" -Value $Disk.DeviceID $DiskHealthState | Add-Member -MemberType NoteProperty -Name "FriendlyName" -Value $($Disk.FriendlyName) $DiskHealthState | Add-Member -MemberType NoteProperty -Name "HealthStatus" -Value $DriveHealthState $DiskHealthState | Add-Member -MemberType NoteProperty -Name "MediaType" -Value $DriveMediaType $DiskHealthState | Add-Member -MemberType NoteProperty -Name "Disk Wear" -Value $([int]($DiskHealth.Wear)) $DiskHealthState | Add-Member -MemberType NoteProperty -Name "Disk $($Disk.DeviceID) Read Errors" -Value $([int]($DiskHealth.ReadErrorsTotal)) $DiskHealthState | Add-Member -MemberType NoteProperty -Name "Disk $($Disk.DeviceID) Temperature Delta" -Value $DiskTempDelta $DiskHealthState | Add-Member -MemberType NoteProperty -Name "Disk $($Disk.DeviceID) ReadErrorsUncorrected" -Value $($Disk.ReadErrorsUncorrected) $DiskHealthState | Add-Member -MemberType NoteProperty -Name "Disk $($Disk.DeviceID) ReadErrorsTotal" -Value $($Disk.ReadErrorsTotal) $DiskHealthState | Add-Member -MemberType NoteProperty -Name "Disk $($Disk.DeviceID) WriteErrorsUncorrected" -Value $($Disk.WriteErrorsUncorrected) $DiskHealthState | Add-Member -MemberType NoteProperty -Name "Disk $($Disk.DeviceID) WriteErrorsTotal" -Value $($Disk.WriteErrorsTotal) # Check for health, read failures, or temperature issues If ($DriveHealthState -ne "Healthy") { $OutputMsg = "Disk $($Disk.DeviceID) / $($Disk.FriendlyName) is in a $([string]$DriveHealthState.ToLower()) state" } elseif ($DriveSMARTStatus -gt $null) { $OutputMsg = "SMART predicted failure detected with reason code $($DriveSMARTStatus.Reason)" } elseif ([int]($DiskHealth.Wear) -ge $MaxWearValue) { $OutputMsg = "Disk failure likely on disk $($Disk.DeviceID) with media type $DriveMediaType. Current wear value is reading as $([int]($DiskHealth.Wear)), above the set threshold of 90%." } elseif ([int]($DiskHealth.ReadErrorsTotal) -ge $MaxRWErrors) { $OutputMsg = "A high number of disk read errors $([int]($DiskHealth.ReadErrorsTotal)) on disk $($Disk.DeviceID) with media type $DriveMediaType" } elseif ([int]($DiskHealth.WriteErrorsTotal) -ge $MaxRWErrors) { $OutputMsg = "A high number of disk write errors $([int]($DiskHealth.WriteErrorsTotal)) on disk $($Disk.DeviceID) with media type $DriveMediaType" } elseif ($([int]($DiskHealth.Temperature)) -gt $([int]($DiskHealth.TemperatureMax)) -and ([int]($DiskHealth.TemperatureMax)) -gt 0) { $OutputMsg = "Disk $($Disk.NumDeviceIDber) is currently running $DiskTempDelta above the maximum temperature rating $($Disk.TemperatureMax) for the drive." } else { $OutputMsg = "Disk $($Disk.DeviceID) is in a healthy state. No action required." } # Write entries to Registry Write-RegistryEntries } # Set remiediation value based on disk issues $DriveHealthIssue = [boolean](Get-ChildItem -Path $RegistryBase -Recurse | Get-ItemProperty | Where-Object { $_.Output -notmatch "No action required" }) if ($DriveHealthIssue -eq $true) { # Flag error value / mark for remediation Write-Output "$((Get-ChildItem -Path $RegistryBase -Recurse | Get-ItemProperty | Where-Object { $_.Output -notmatch "No action required" }).Output)"; exit 1 } else { # No issues found Write-Output "Disks are in a healthy state. No action required"; exit 0 } #endregion ScriptRunningCode
Running this on our environment, we can view the output by adding the pre-remediation output value, as per the example below;
Remediation Script
In the event that a disk does report back a “suspect” state, the next thing we can do (but you don’t have to, this is purely optional), is notify the end user. The idea of course is to be proactive about getting an issue resolved, and hence if the user is also informed, for example, to contact the IT Service Desk, then it should help drive compliance for hard disk swap outs.
Here I am using elements from Ben Whitmore’s notification script, updated so the images are downloaded from URL’s provided within the script, and removing some of the hardcoding and external dependencies. The script also caters for the fact you will be running this in system context, displaying the prompt to the signed in user through a task scheduler event;
<# .NOTES =========================================================================== Created on: 21/04/2021 11:00 AM Created by: Maurice Daly / Ben Whitmore Organization: CloudWay Filename: Invoke-DiskHealthNotification.ps1 =========================================================================== .DESCRIPTION Monitors WMI values for hard disk health, helping you predict or detect anomalies and be preactive about hard disk replacement. #> Param ( [Parameter(Mandatory = $False)] [String]$ToastGUID ) #region ToastCustomisation #Create Toast Variables $ToastTitle = "Please note that your hard drive is currently operating outside of healthy parameters. Please contact the IT service desk to arrange a replacement." $Signature = "Monitored by Proactive Remediations" $ButtonTitle = "IT Service Desk" $ButtonAction = "YOUR IT HELPDESK URL" $SnoozeTitle = "Snooze" #ToastDuration: Short = 7s, Long = 25s $ToastDuration = "long" #Images $BadgeImageUri = "https://YOUR STORAGE URL/Notifications/badgeimage.jpg" $HeroImageUri = "https://YOUR STORAGE URL/Notifications/harddisk.jpg" $BadgeImage = Join-Path $ENV:Windir -ChildPath "Temp\badgeimage.jpg" $HeroImage = Join-Path $ENV:Windir -ChildPath "Temp\harddisk.jpg" #endregion ToastCustomisation #region ToastRunningValues #Set Unique GUID for the Toast If (!($ToastGUID)) { $ToastGUID = ([guid]::NewGuid()).ToString().ToUpper() } #Current Directory $ScriptPath = $MyInvocation.MyCommand.Path $CurrentDir = Split-Path $ScriptPath #Set Toast Path to UserProfile Temp Directory $ToastPath = (Join-Path $ENV:Windir "Temp\$($ToastGuid)") $ToastPSFile = $MyInvocation.MyCommand.Name #endregion ToastRunningValues #region ScriptFunctions # Toast function function Display-ToastNotification { #Fetching images from URI $WebClient = New-Object System.Net.WebClient $WebClient.DownloadFile("$BadgeImageUri", "$BadgeImage") $WebClient.DownloadFile("$HeroImageUri", "$HeroImage") #Set COM App ID > To bring a URL on button press to focus use a browser for the appid e.g. MSEdge #$LauncherID = "Microsoft.SoftwareCenter.DesktopToasts" $LauncherID = "{1AC14E77-02E7-4E5D-B744-2EB1AE5198B7}\WindowsPowerShell\v1.0\powershell.exe" #$Launcherid = "MSEdge" #Dont Create a Scheduled Task if the script is running in the context of the logged on user, only if SYSTEM fired the script i.e. Deployment from Intune/ConfigMgr If (([System.Security.Principal.WindowsIdentity]::GetCurrent()).Name -eq "NT AUTHORITY\SYSTEM") { #Prepare to stage Toast Notification Content in %TEMP% Folder Try { #Create TEMP folder to stage Toast Notification Content in %TEMP% Folder New-Item $ToastPath -ItemType Directory -Force -ErrorAction Continue | Out-Null $ToastFiles = Get-ChildItem $CurrentDir -Recurse #Copy Toast Files to Toat TEMP folder ForEach ($ToastFile in $ToastFiles) { Copy-Item (Join-Path $CurrentDir $ToastFile) -Destination $ToastPath -ErrorAction Continue } } Catch { Write-Warning $_.Exception.Message } #Set new Toast script to run from TEMP path $New_ToastPath = Join-Path -Path $ToastPath -ChildPath $ToastPSFile #Created Scheduled Task to run as Logged on User $Task_TimeToRun = (Get-Date).AddSeconds(30).ToString('s') $Task_Expiry = (Get-Date).AddSeconds(120).ToString('s') $Task_Action = New-ScheduledTaskAction -Execute "C:\WINDOWS\system32\WindowsPowerShell\v1.0\PowerShell.exe" -Argument "-NoProfile -WindowStyle Hidden -File ""$New_ToastPath"" -ToastGUID ""$ToastGUID""" $Task_Trigger = New-ScheduledTaskTrigger -Once -At $Task_TimeToRun $Task_Trigger.EndBoundary = $Task_Expiry $Task_Principal = New-ScheduledTaskPrincipal -GroupId "S-1-5-32-545" -RunLevel Limited $Task_Settings = New-ScheduledTaskSettingsSet -Compatibility V1 -DeleteExpiredTaskAfter (New-TimeSpan -Seconds 600) -AllowStartIfOnBatteries $New_Task = New-ScheduledTask -Description "Toast_Notification_$($ToastGuid) Task for user notification. Title: $($EventTitle) :: Event:$($EventText) :: Source Path: $($ToastPath) " -Action $Task_Action -Principal $Task_Principal -Trigger $Task_Trigger -Settings $Task_Settings Register-ScheduledTask -TaskName "Toast_Notification_$($ToastGuid)" -InputObject $New_Task } #Run the toast of the script is running in the context of the Logged On User If (!(([System.Security.Principal.WindowsIdentity]::GetCurrent()).Name -eq "NT AUTHORITY\SYSTEM")) { $Log = (Join-Path $ENV:Windir "Temp\$($ToastGuid).log") Start-Transcript $Log #Get logged on user DisplayName #Try to get the DisplayName for Domain User $ErrorActionPreference = "Continue" Try { Write-Output "Trying Identity LogonUI Registry Key for Domain User info..." Get-Itemproperty -Path "HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Authentication\LogonUI" -Name "LastLoggedOnDisplayName" -ErrorAction Stop | out-null $User = Get-Itemproperty -Path "HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Authentication\LogonUI" -Name "LastLoggedOnDisplayName" | Select-Object -ExpandProperty LastLoggedOnDisplayName -ErrorAction Stop | out-null If ($Null -eq $User) { $Firstname = $Null } else { $DisplayName = $User.Split(" ") $Firstname = $DisplayName[0] } } Catch [System.Management.Automation.PSArgumentException] { "Registry Key Property missing" Write-Warning "Registry Key for LastLoggedOnDisplayName could not be found." $Firstname = $Null } Catch [System.Management.Automation.ItemNotFoundException] { "Registry Key itself is missing" Write-Warning "Registry value for LastLoggedOnDisplayName could not be found." $Firstname = $Null } #Try to get the DisplayName for Azure AD User If ($Null -eq $Firstname) { Write-Output "Trying Identity Store Cache for Azure AD User info..." Try { $UserSID = (whoami /user /fo csv | ConvertFrom-Csv).Sid $LogonCacheSID = (Get-ChildItem HKLM:\SOFTWARE\Microsoft\IdentityStore\LogonCache -Recurse -Depth 2 | Where-Object { $_.Name -match $UserSID }).Name If ($LogonCacheSID) { $LogonCacheSID = $LogonCacheSID.Replace("HKEY_LOCAL_MACHINE", "HKLM:") $User = Get-ItemProperty -Path $LogonCacheSID | Select-Object -ExpandProperty DisplayName -ErrorAction Stop $DisplayName = $User.Split(" ") $Firstname = $DisplayName[0] } else { Write-Warning "Could not get DisplayName property from Identity Store Cache for Azure AD User" $Firstname = $Null } } Catch [System.Management.Automation.PSArgumentException] { Write-Warning "Could not get DisplayName property from Identity Store Cache for Azure AD User" Write-Output "Resorting to whoami info for Toast DisplayName..." $Firstname = $Null } Catch [System.Management.Automation.ItemNotFoundException] { Write-Warning "Could not get SID from Identity Store Cache for Azure AD User" Write-Output "Resorting to whoami info for Toast DisplayName..." $Firstname = $Null } Catch { Write-Warning "Could not get SID from Identity Store Cache for Azure AD User" Write-Output "Resorting to whoami info for Toast DisplayName..." $Firstname = $Null } } #Try to get the DisplayName from whoami If ($Null -eq $Firstname) { Try { Write-Output "Trying Identity whoami.exe for DisplayName info..." $User = whoami.exe $Firstname = (Get-Culture).textinfo.totitlecase($User.Split("\")[1]) Write-Output "DisplayName retrieved from whoami.exe" } Catch { Write-Warning "Could not get DisplayName from whoami.exe" } } #If DisplayName could not be obtained, leave it blank If ($Null -eq $Firstname) { Write-Output "DisplayName could not be obtained, it will be blank in the Toast" } $CustomHello = "Disk Health Issue Detected" #Load Assemblies [Windows.UI.Notifications.ToastNotificationManager, Windows.UI.Notifications, ContentType = WindowsRuntime] | Out-Null [Windows.Data.Xml.Dom.XmlDocument, Windows.Data.Xml.Dom.XmlDocument, ContentType = WindowsRuntime] | Out-Null #Build XML ToastTemplate [xml]$ToastTemplate = @" <toast duration="$ToastDuration" scenario="reminder"> <visual> <binding template="ToastGeneric"> <text>$CustomHello</text> <text>$ToastTitle</text> <text placement="attribution">$Signature</text> <image placement="hero" src="$HeroImage"/> </binding> </visual> <audio src="ms-winsoundevent:notification.default"/> </toast> "@ #Build XML ActionTemplate [xml]$ActionTemplate = @" <toast> <actions> <action arguments="$ButtonAction" content="$ButtonTitle" activationType="protocol" /> <action arguments="dismiss" content="Dismiss" activationType="system"/> </actions> </toast> "@ #Define default actions to be added $ToastTemplate $Action_Node = $ActionTemplate.toast.actions #Append actions to $ToastTemplate [void]$ToastTemplate.toast.AppendChild($ToastTemplate.ImportNode($Action_Node, $true)) #Prepare XML $ToastXml = [Windows.Data.Xml.Dom.XmlDocument]::New() $ToastXml.LoadXml($ToastTemplate.OuterXml) #Prepare and Create Toast $ToastMessage = [Windows.UI.Notifications.ToastNotification]::New($ToastXML) [Windows.UI.Notifications.ToastNotificationManager]::CreateToastNotifier($LauncherID).Show($ToastMessage) Stop-Transcript } } #endregion RegionName #region ScriptRunningCode # Display notification for drive failure if present in the registy $DriveHealthIssue = [boolean](Get-ChildItem -Path $RegistryBase -Recurse | Get-ItemProperty | Where-Object { $_.Output -notmatch "No action required" }) if ($DriveHealthIssue -eq $true) { Display-ToastNotification } #endregion ScriptRunningCode
Reading the registry from the detection script, the remediation script will display a notice to the end user should the output state there is an issue;
Clicking on the IT Service Desk button is again also optional, but this can be configured to open your ticketing system for example.
Monitoring
Going back to Endpoint Analytics, we can see the output for failures by adding the Pre Remediation Output column and then examining those machines where the detection status is “failed”;
Configuration Manager
Not forgetting about those running Configuration Manager clients in standalone environments, the same scripts will of course also function well within a one time script, or within a configuration baseline.
Script run:
Conclusion
Although it might not be strictly possible to get your OEM to issue a replacement, through Proactive Remediations / Endpoint Analytics, we can at least try to predict hard disk failures before they occur.
Thanks for reading
Add comment