Hardware Health Checks

If you buy new hardware or installed a new machine, you should always check the health and prepare for failures.

RAM

New RAM should be tested once fully for failures. In case you need to test RAM adhoc, add Memetest to your boot manager.

Memtest86 – UEFI

Attention: Memtest is proprietary software! Recommendation: Use the free and open source Memtest86+ on a normal Arch Linux USB Stick in CSM mode (non UEFI boot) instead of installing Memtest86-efi.

yay memtest86-efi 
memtest86-efi --install

If you use systemd-boot you can add a boot menu entry by selecting option 4.

Memtest86+ – GRUB

Memtest86+ (license GPL) doesn’t support DDR4. Recommendation: Use a normal Arch Linux USB Stick in CSM mode (non UEFI boot) instead of installing it.

pacman -Sy memtest86+
grub-mkconfig -o /boot/grub/grub.cfg

ECC RAM

ECC RAM is used in servers and workstations to detect bit errors, but if available like with most Ryzen mainboard, it is recommended as well for desktops (at least Linus Torvalds recommends it https://plus.google.com/+LinusTorvalds/posts/VdLMbfmgmGJ). It is usually able to detect and correct single bit errors, multi bit errors can only be detected but NOT corrected.

sudo pacman -Sy edac-utils

Now check, if ECC is activated by the memory controller and if it reports any issue.

$ edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
edac-util: No errors to report.

In this case ECC is working. Next check:

dmesg | grep -i edac
[ 0.586752] EDAC MC: Ver: 3.0.0
[ 4.108749] EDAC amd64: Node 0: DRAM ECC enabled.
[ 4.108751] EDAC amd64: F17h detected (node 0).
[ 4.108802] EDAC MC: UMC0 chip selects:
[ 4.108803] EDAC amd64: MC: 0: 8192MB 1: 0MB
[ 4.108804] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 4.108805] EDAC amd64: MC: 4: 0MB 5: 0MB
[ 4.108805] EDAC amd64: MC: 6: 0MB 7: 0MB
[ 4.108808] EDAC MC: UMC1 chip selects:
[ 4.108808] EDAC amd64: MC: 0: 0MB 1: 0MB
[ 4.108809] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 4.108809] EDAC amd64: MC: 4: 0MB 5: 0MB
[ 4.108810] EDAC amd64: MC: 6: 0MB 7: 0MB
[ 4.108810] EDAC amd64: using x8 syndromes.
[ 4.108811] EDAC amd64: MCT channel count: 1
[ 4.108915] EDAC MC0: Giving out device to module amd64_edac controller F17h: DEV 0000:00:18.3 (INTERRUPT)
[ 4.108929] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
[ 4.108929] AMD64 EDAC driver v3.5.0

It states ECC is enabled and that the extra bit for ECC is available via x8 syndromes.

Ryzen supports ECC, for more details read http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/75030-ecc-memory-amds-ryzen-deep-dive.html. But it will not halt the system in case of an ECC error, so there is definitely still an open task for ECC on Ryzen at least somewhere in Linux, microcode or any other component.

SSD/HDD

Smartctl tests

HDD/SSD health should be checked regularly, especially on file servers.

Install Smartmontools

pacman -Sy smartmontools

Graphical interface:

pacman -Sy gsmartcontrol

or from AUR the KDE disKmonitor

yay diskmonitor

Smartd

Enable automatic smart checks and mail service (very important on real hardware servers). For details read https://wiki.archlinux.org/index.php/S.M.A.R.T..

sudo systemctl enable smartd.service

If you need you can modify the settings in:

$ nano /etc/smartd.conf

To get the complete overview about a drive, run:

$ smartctl -a /dev/sda

To get just a short health status, run:

$ smartctl -H /dev/sda

To run the short, conveyance or long self checks (not every check is provided by every disk, and sadly it doesn’t work for USB drives), execute:

$ smartctl -t long /dev/sda

To view test results:

smartctl -l selftest /dev/sda

Setup automatic Mails about drive healt (OPEN)

# -m Send a Mail to this address, -M send a mail after each start of the service so that you know which drive will be monitored

DEVICESCAN -m address@domain.com -M test

Do not spin up disk in standby:

DEVICESCAN -n standby,15,q

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.