Key Task: Troubleshoot System or Hardware (Device Health)

The Device Health dashboard gives a high level view of all the errors and crashes (health events) occurring on all devices across the enterprise. Check to see if similar types of crashes occur in different places and view the details of any crash to troubleshoot issues on Windows and mobile devices across your organization. A health event indicates a device or application encountered a significant error which affects the overall health of the user experience, like a corrupted disk or a system crash. There are three types of health events: application health events, hardware health events, and system health events.

For example, if several users complain of application crashes or system failures in your organization, use this dashboard to isolate the component causing mischief, and investigate further on the root cause of the problem. This dashboard is ideal to determine any common traits between the devices in each of these problems. In contrast, to troubleshoot a single device only, check the Troubleshoot Device dashboard.

Examine health events in detail with the Device Health dashboard

You can easily view the number of users impacted by the same health event, and view more details, like the element which caused the problem (for example the name of the application process which crashed). You can also view the devices which suffered the same health event, and then drill down for more information about the device.

Tip

You can also view and analyze this data using the EVICE_HEALTH_RAW REST API. (Learn more).

Procedure

  1. Step 1 Open a browser and sign in to Aternity.
  2. Step 2 Select Main Menu > Troubleshoot > System or Hardware.
    Accessing system troubleshooting
  3. Step 3 Select the type of issue to troubleshoot.
    Selecting the type of system problem to troubleshoot
    Field Description
    System issue

    Select to examine an operating system, networking, low memory, low disk space, or printing issue.

    Device hardware issue

    Select to examine a hardware issue, like hard disk failures, or laptop battery issues.

    Boot issue

    Select if the boot times of devices are taking longer than expected, to go to the Boot Analysis dashboard, to see the services which take the longest times to load at system startup.

  4. Step 4 Select whether to display numbers of users or numbers of events in the View menu at the top of the screen.
    Select to view the number of reported events or the number of unique users affected by these events
    Field Description
    Volume

    Displays the total number of reported health events for each type of event.

    Users

    Displays the number of unique users reporting each type of health event.

  5. Step 5 View the recent number of health events in your organization during the timeframe in the Trends section, to see if there has been a sudden change in the number of health events recently reported to Aternity.
    View the recent history of the number of health events or the number of users with health events

    The number above each bar represents the number of health events reported at that time, or the number of unique users with health events, as selected in the View menu. Hover over the vertical bars to view both.

  6. Step 6 View the health events in the Health Events section, according to the Category you select in the drop-down list in the top bar.

    Select Apply to display the categories you chose.

    Select the type of health events to display in this dashboard

    Field

    Description

    (All)

    Select to view all the health events for all categories in the Health Events section.

    Application

    Displays application health events on applications which have a user interface.

    Background Process

    Displays application health events for programs which run in the background, without a user interface, like Windows services.

    Hardware

    Displays hardware health events like memory paging and hardware failures, sorted by those which occur most often.

    System

    Displays system health events, like system crashes (BSODs), sorted by the ones which occurred most often.

    1. a Select Category > Application or Category > Background Process in the top bar, to view the application or background process health events, sorted by the highest number of events.
      Field Description Source
      Crash (on Windows, Mac) or App Crashes (on mobile)

      (Windows) Aternity registers a Windows app crash when the Event Log issues event ID 1000 (a process or DLL ends unexpectedly), event ID 1001 (.NET process ends unexpectedly), event ID 1002 (a user stops a Not Responding process), or event ID 1026 (.NET runtime error).

      To resolve, note any error numbers, or check the logs of the application, then consult the support site of the application vendor.

      (Macs) Aternity reports a native Mac app crashing only if it registers the crash in the MacOS system log.

      (For monitored mobile apps only) Aternity reports a crashing monitored mobile app if it experiences an unhandled exception, or if the operating system (iOS or Android) tells it to abruptly stop (abort signal). For every mobile app crash, Aternity collects the exception code and type of exception, the app's stack trace, and a summary of the crash information. It also collects any breadcrumbs leading up to the crash. You can download the memory dump file if needed.

      (Windows) Agent queries Windows Event Log

      (Mac) The SteelCentral Agent for Mac queries the macOS system log.

      (Mobile) The Aternity Mobile SDK receives a notification that the monitored app crashed.

      Crash (After Hang)

      (Windows) Event ID 1002 occurs when a user manually forced an application's process to close after it stopped responding.

      (Mac) Aternity uses the system log to determine when a user has manually forced an application's process to close after it stopped responding.

      To resolve, note any common actions leading to the hang, then consult the app vendor's support site.

      (Windows) Agent queries Windows Event Log

      (Mac) The SteelCentral Agent for Mac queries the macOS system log.

      Crash (DotNet)

      (Windows only) Windows event ID 1001 occurs when a .Net process or DLL ended unexpectedly.

      To resolve, note any error numbers, or check the logs of the application, then consult the support site of the application vendor.

      Agent queries Windows Event Log

      DotNet Runtime Error

      (Windows only) Windows event ID 1026 appears when a handled exception in .NET occurs.

      You don't have to resolve. You may check the logs of the application to see which exception occurred.

      Agent queries Windows Event Log

    2. b Select Category > System in the top bar, to view the system health events, sorted by the highest number of events.
      Field Description Source
      Low Disk Space

      Aternity creates this event if the device's system disk has less than 5% free space, or less than 500MB available, which limits the size of virtual memory.

      Low virtual memory significantly slows down performance, and causes applications to malfunction or crash with memory exception errors.

      To resolve, free some disk space (empty trash, remove unused apps) or increase its capacity.

      Agent queries Windows API once a minute.

      Low Memory Pagefaults

      Aternity creates this event if a device uses more than 95% of its physical memory AND issued more than 1000 virtual memory accesses (hard page faults) per second.

      High usage of virtual memory slows performance significantly, because using the hard disk instead of RAM is 1000 times slower than physical memory.

      To resolve, increase the capacity of RAM on the device.

      Agent queries Windows Performance Counters once a minute.

      Since this status can continue for some time, it reports only one such event per day for each device.

      Low Virtual Memory

      Aternity creates this event if the device uses more than 90% of its virtual memory (hard disk) for more than three minutes.

      Low virtual memory significantly slows down performance, and causes applications to malfunction or crash with memory exception errors.

      To resolve, free some disk space (empty trash, remove unused apps) or increase its capacity. In addition, increase the capacity of RAM on the device.

      Agent queries Windows Performance Counters once every three minutes.

      Since this status can continue for some time, it reports only one such event per day for each device.

      Memory Allocation Failure / Nonpaged

      Windows event ID 2019 is caused by a memory leak. It has the description The server was unable to allocate from the system non-paged pool because the pool was empty. For details and a resolution, search this error ID in Microsoft's support site.

      Agent queries Windows Event Log

      Memory Allocation Failure / Paged

      Windows event ID 2020 has the description The server was unable to allocate from the system paged pool because the pool was empty. For details and a resolution, search this error ID in Microsoft's support site.

      Agent queries Windows Event Log

      Network Interface Long Queue

      Aternity creates this event if a device's network connection has a queue of more than two data items waiting to be sent. Network jams often result in slower performance.

      If this remains consistently high, it could point to a hardware problem with the NIC (network interface card) or other networking component.

      To resolve, consider updating the component's driver, or find the cause of high loads on specific servers. Also consider checking the component's driver settings, verifying that the Windows Receive Side Scaling (RSS) option is enabled.

      Agent queries Windows Performance Counters once every five minutes.

      Since this status can continue for some time, it reports only one such event per day for each device.

      Network Interface Saturation

      Aternity creates this event if a device's network interface card (NIC) is using more than 75% of its bandwidth capacity.

      As a NIC reaches saturation, it starts to lose network packets, resulting in performance drops.

      If saturation persists over several days, add a faster network card or consider segmenting the network, or scheduling the high bandwidth activities to off-peak hours. For example, you can optimize traffic by scheduling backups or virus checks at night.

      Agent queries Windows API to check all the device's NICs once every five minutes.

      Since this status can continue for some time, it reports only one such event per day for each device.

      Overheat Related Shutdown

      Windows event ID 86 occurs when the device shuts down due to overheating (critical thermal event).

      It indicates a hardware problem, like a dusty CPU, broken fan or obstructed air vent.

      Turn off your computer, clean the heat sinks, and make sure that air circulates properly.

      Agent queries Windows Event Log

      Printing Error / Bad Security Descriptor

      Windows event ID 366 occurs when the print queue security settings are not configured correctly.

      Try restarting the print spooler. For details and a resolution, search this error ID in Microsoft's support site.

      Agent queries Windows Event Log

      Printing Error / Init Failed

      Windows event ID 354 indicates the printing operation failed to initialize, due to low system resources on the device. For details and a resolution, search this error ID in Microsoft's support site.

      Agent queries Windows Event Log

      Printing Error / No Driver Found

      Windows event ID 319 occurs when the printer could not initialize.

      This typically occurs when operating system did not find a suitable driver.

      To resolve, install a compatible printer driver. For details and a resolution, search this error ID in Microsoft's support site.

      Agent queries Windows Event Log

      Printing Error / Package Regeneration Failed

      Windows event ID 73 occurs when the print spooler failed to regenerate the printer driver information. This can occur after a system upgrade or a disk corruption.

      If this issue persists, it indicates low system resources (CPU, disk I/O or memory resources).

      To resolve, investigate the Top Processes section in the Troubleshoot Device dashboard.

      Agent queries Windows Event Log

      Printing Error / Port Init Error

      Windows event ID 66 occurs when the printer failed to initialize its ports.

      The error message states, This error usually occurs because of a problem with the port monitor. Try recreating the port using a standard TCP/IP printer port, if possible. This problem does not affect other printers.

      Agent queries Windows Event Log

      Printing Error / Print Failed

      Windows event ID 372 occurs when a document failed to print.

      Try printing again or restart the print spooler.

      Agent queries Windows Event Log

      Printing Error / Spooler Creation Failed

      Windows event ID 363 occurs when the print spooler failed to start.

      If this issue persists, it indicates low system resources (CPU, disk I/O or memory resources).

      To resolve, investigate the Top Processes section in the Troubleshoot Device dashboard.

      Agent queries Windows Event Log

      Printing Error / Spooler Out of Resources

      Windows event ID 373 occurs when a component of the spooler has too many open Graphical Device Interface (GDI) objects.

      As a result, some enhanced metafile (EMF) print jobs might not print.

      To resolve, restart the spooler.

      Agent queries Windows Event Log

      Printing Error / Spooler Shutdown

      Windows event ID 99 occurs when the print spooler encountered a fatal error while executing a critical operation and must immediately shut down.

      To resolve, restart the print spooler service from Windows Services, or open the command prompt and typing net start spooler.

      Agent queries Windows Event Log

      System Crash

      (Windows) Aternity reports a system crash when Windows created a memory dump file after a BSOD. Aternity analyzes the Windows dump and extracts data:

      • The likely name of Windows process which caused the crash.

      • The module or driver which caused the issue, including the name, start address, and offset.

      • Displays the event, which contains Microsoft's stop error codes ('bug check codes').

      (Macs) Aternity reports a system crash when it detected a kernel panic from the macOS system logs.

      To troubleshoot, view the details of the event and research further on the name of the process or module and its error codes.

      (Windows) Agent queries Windows API

      (Mac) The SteelCentral Agent for Mac queries the macOS system log.

      Unexpected Shutdown

      (Windows) Event ID 6008 indicates an unexpected shutdown.

      (Mac) Aternity reports an unexpected shutdown when reported in the macOS system logs.

      This can be due to a hardware failure (like a power cut, or excessive heat) or a firmware or driver fault, or when a program forces the device to shut down while the computer is locked and password-protected.

      To troubleshoot, check the Event Viewer for critical errors which might correlate with the shutdown. For example, if you see a disk controller error (Event ID 11), you can run check disk (chkdsk), or check each disk with the S.M.A.R.T utility.

      Agent queries Windows Event Log

      (Mac) The SteelCentral Agent for Mac queries the macOS system log.

      WiFi Disconnect

      (On Windows devices with Agent 9.2 or later) Aternity reports whenever a device unexpectedly stops receiving the signal from a WiFi network, and suddenly disconnects from the network.

      This only reports unexpected disconnects, NOT through a user action like switching to airplane mode, attaching a laptop to a docking station (where it continues its connection via LAN) or putting it in sleep mode.

      (Windows) Agent queries Windows API.

      Windows Update Failure

      Windows event ID 20 occurs when the process for updating Windows failed.

      This can happen when installing a corrupt update, or if a previous update is missing, or if you install an update before a required reboot, or with a poor network connection, or if the user does not have the required permissions to install the update, and so on.

      To resolve, restart the device and then install the updates manually. In addition, try to perform a system restore to revert to the state before the failed updates, or use the Windows Update troubleshooter to diagnose and fix the update problems.

      Agent queries Windows Event Log

    3. c Select Category > Hardware in the top bar, to view the hardware health events, sorted by the highest number of events.
      Field Description Source
      Battery Wear

      (Windows laptops only) Aternity checks if the battery capacity drops below a threshold (default is 50%), compared with the vendor's factory settings. This indicates that a full battery charge drains much faster than it should.

      To resolve, replace the battery.

      Agent queries Windows API once a day to obtain the battery's Designed Capacity versus its Current Capacity Value.

      Corrupted FS

      This event occurs when the system disk contains damaged or corrupted files, which may cause Windows crashes and data loss.

      This could happen, for example, after a power cut, or after a hardware change.

      To resolve, run the System File Checker to restore corrupted files, or rescue its data by connecting the system disk as a slave drive to another device. If the disk is physically damaged, use third party data rescue services.

      Agent queries Windows Event Log once a minute.

      Since Windows can generate a flood of events for a problem like this, it reports only up to two events for each device per minute.

      Faulty HD S.M.A.R.T status

      Aternity checks if the device's self-monitoring hard disk (SMART disk) generated an error.

      S.M.A.R.T drives (Self-Monitoring, Analysis, and Reporting Technology) check their own reliability and give advanced warnings if they start to fail. These warnings could predict complete failure, or something less significant, like the inability to write to a sector, or slower performance.

      To resolve, backup your data as soon as possible and determine whether you should replace the drive.

      Agent queries Windows API once per minute.

      Since this status can continue for some time, it reports only one such event per day for each device.

      HD Bad Blocks

      Windows event ID 7 occurs with a corrupted block of data on the hard disk. If many bad sectors develop, the drive may fail and needs attention.

      Replace a physically damaged disk immediately. For 'soft' or logical bad sectors, you can use Windows Disk Check.

      Agent queries Windows Event Log

      HD Failure

      Windows event ID 52 occurs with an imminent failure of the hard disk.

      Back up your data immediately, then use a scanning tool to detect problems. For example, if a disk is too hot, switch off the PC and disconnect the power of that hard disk until you replace it.

      Agent queries Windows Event Log

  7. Step 7 Correlate the health events with any of Aternity’s attributes, by selecting one from the drop-down menu in any of the four sections at the bottom of the dashboard.

    Find a view where one or two entries seem to concentrate most or all the selected health events. For example, select a health event and check if its occurrences are only with a specific operating system.

    Select the data to display
    Field Description
    Components

    Displays the name, type and version of the part of the software or hardware which caused this health event. For example, a battery, a network interface, a disk drive, printer, an application, (application name and its process name like Acrobat Reader (AcroRd32.exe), or Point of Sale (com.company.app2) for mobile apps). You can also hover your mouse over any of the horizontal bars in the section, and drill down to research the root cause of the crash on the web.

    Business Locations

    Lists the various locations in your organization impacted by a health event.

    Event Details

    Displays additional information about the component which caused this health event (for example, the memory type for a memory allocation failure event, or the DLL version of an application crash, and so on). You can also hover your mouse over any of the horizontal bars in the section, and drill down to research the root cause of the crash on the web.

    Departments

    Lists the departments in your organization impacted by the health event.

    Regions

    You can optionally define a region in Aternity to group together several locations under a single label, like the geographical region of EMEA, North America or even Southern Europe, South-Western US any other grouping you choose.

    Countries

    Displays the country of the current location of the device.

    States

    Displays the geographical state of the current location of the devices (or area, if state is not applicable).

    Cities

    Displays the city of the current location of the device.

    Usernames

    Displays the username signed in to the device's operating system.

    You can investigate further on the possible causes of the health event by drilling down to the Monitor User Experience dashboard.

    Device Names

    Displays the hostname of the monitored device. View it in the Windows Control Panel > System > Computer Name, or on Apple Macs in System Preferences > Sharing > Computer Name.

    (Mobile) Displays the Device Name field. You can customize the hostname of iOS or Android devices running your enterprise's app, so device names appear in the dashboards with a consistent naming policy. For example, you can dynamically assign the device name according to the enterprise username of the app.

    You can investigate further on the possible causes of the health event by drilling down to the Monitor User Experience dashboard or to the Troubleshoot User or Device dashboard.

    Device Types

    Displays the type of device reporting performance to Aternity.

    • Desktops are monitored Windows devices without a fitted battery, or for Macs, any monitored MacBook running macOS or OS X.

    • Laptops are Windows devices with a battery and a built-in keyboard (including all Windows hybrid tablet/laptop models), or for Macs, any monitored laptop running macOS or OS X.

    • Remote Devices have applications accessed remotely via an RDP protocol, for example, with Microsoft's Remote Desktop Connection.

    • Smartphones run monitored mobile apps on a small touch screen within a mobile operating system environment.

    • Tablets have larger touch screens, and no built-in keyboard, running iOS or Android. If it runs Windows, it is defined as a tablet if it is a known model of a Windows pure tablet (like Microsoft Surface models).

    • Virtual App Servers offer multiple users access to a single instance of an application, for example, with Citrix XenApp.

    • Virtual Desktops offer the ability to run an application within a VDI environment, which is a virtual instance of the entire desktop operating system (usually Windows).

    OS Families

    Displays the broad category of the operating system. Use this to differentiate between different major operating system groups. For example, it displays all releases of Microsoft Windows as MS Windows, all releases of Windows Server as MS Windows Server or all releases of iOS as iOS.

    OS Names

    Displays the generic name and version of the operating system (like MS Windows 10, MS Windows Server 2008 R2, MacOS 10.3, iOS 10 or Android 6). For example, it displays Windows 10 Pro and Windows 10 Enterprise all as MS Windows 10.

    OS Versions

    (For all devices except mobile)

    Displays the full name, the exact version number, and the service pack version of the operating system. In Windows 10, it includes the release ID (like Microsoft Windows 10 Enterprise 1507). Use this to differentiate between details of the same operating system. For example, it lists MS Windows Server 2008 R2 Enterprise SP 1.0 separately from MS Windows Server 2008 R2 Enterprise SP 2.0.

    OS Architectures

    Displays whether the operating system of the monitored device is 32-bit or 64-bit.

    OS Disk Types

    (Windows only, Agent 9.0.3 or later) Displays the type of hard disk containing the operating system. Possible values are:

    • HDD for a traditional spinning hard disk drive

    • SSD for a solid state drive

    • Virtual if this is not a physical device.

    Manufacturers

    Displays the name of the vendor which created this device, like Samsung, Apple, Dell, Lenovo, and so on.

    Models

    Displays the name and the model number of the device, like iPhone 6s, GalaxyTab8, MacBook Pro 12.1, Dell Latitude D620.

    # CPU Cores

    (Desktops, laptops and mobile devices only) Displays the number of CPU cores of the device.

    Memory Size

    Displays the size of physical RAM of the device.

  8. Step 8 To view the details of a single health event, select the event in the Health Events section and view the information in the other sections of the dashboard.
    View the details of a single health event by selecting it

    The entire dashboard displays only the data concerning the selected event.

    To view further details of a single component from a single health event, like the version number of the process associated with an application crash, select an entry from the Components section to view its details, the devices and locations which suffer from that health event, and so on.

    View the details of a single component of a health event
  9. Step 9 You can also drill down from a component to research the root cause of the crash on the web.
    Research a component's crash message in more detail

    Select Research crash root cause to search for errors with the component name, its associated DLL if relevant, and the phrase Application Crash.

    Look for support forums or technotes in the vendor’s knowledge base which describe similar crashes. Check for patches or suggested changes to the configuration to resolve the problem. Apply the patches or changes to the devices affected by the issue, listed in the Devices section of the dashboard.

    Once you applied the changes, return to this dashboard, and select the same component name. View the Trends section to validate that the number of occurrences dropped after the changes were applied.

    If the problem persists, contact the application vendor and open a support ticket. Use this dashboard to provide them with the application version, DLL, and any additional details they request.

  10. Step 10 To search for an item in any of the sections with long lists, use the search box of that section.

    For example, to search for a particular hostname, type the name into the Device Names search box and select enter.

    Search for a device, component or event detail
  11. Step 11 You can limit the scope of this dashboard in the Timeframe field at the top of the screen., or enter a location name in the Location search field, to display only the information concerning that particular location.
    Change the scope of data displayed in this dashboard with the Timeframe or the Location menu
    Field Description
    Timeframe

    You can change the start time of the data displayed in this dashboard in the Timeframe menu in the top right corner of the dashboard.

    You can access data in this dashboard (retention) going back up to 30 days. This dashboard's data refreshes every hour.