If you buy new hardware or installed a new machine, you should always check the health and prepare for failures.
New RAM should be tested once fully for failures. In case you need to test RAM adhoc, add Memetest to your boot manager.
Memtest86 – UEFI
Attention: Memtest is proprietary software! Recommendation: Use the free and open source Memtest86+ on a normal Arch Linux USB Stick in CSM mode (non UEFI boot) instead of installing Memtest86-efi.
yay memtest86-efi memtest86-efi --install
If you use systemd-boot you can add a boot menu entry by selecting option 4.
Memtest86+ – GRUB
Memtest86+ (license GPL) doesn’t support DDR4. Recommendation: Use a normal Arch Linux USB Stick in CSM mode (non UEFI boot) instead of installing it.
pacman -Sy memtest86+ grub-mkconfig -o /boot/grub/grub.cfg
ECC RAM is used in servers and workstations to detect bit errors, but if available like with most Ryzen mainboard, it is recommended as well for desktops (at least Linus Torvalds recommends it https://plus.google.com/+LinusTorvalds/posts/VdLMbfmgmGJ). It is usually able to detect and correct single bit errors, multi bit errors can only be detected but NOT corrected.
sudo pacman -Sy edac-utils
Now check, if ECC is activated by the memory controller and if it reports any issue.
$ edac-util -v mc0: 0 Uncorrected Errors with no DIMM info mc0: 0 Corrected Errors with no DIMM info mc0: csrow0: 0 Uncorrected Errors mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors edac-util: No errors to report.
In this case ECC is working. Next check:
dmesg | grep -i edac [ 0.586752] EDAC MC: Ver: 3.0.0 [ 4.108749] EDAC amd64: Node 0: DRAM ECC enabled. [ 4.108751] EDAC amd64: F17h detected (node 0). [ 4.108802] EDAC MC: UMC0 chip selects: [ 4.108803] EDAC amd64: MC: 0: 8192MB 1: 0MB [ 4.108804] EDAC amd64: MC: 2: 0MB 3: 0MB [ 4.108805] EDAC amd64: MC: 4: 0MB 5: 0MB [ 4.108805] EDAC amd64: MC: 6: 0MB 7: 0MB [ 4.108808] EDAC MC: UMC1 chip selects: [ 4.108808] EDAC amd64: MC: 0: 0MB 1: 0MB [ 4.108809] EDAC amd64: MC: 2: 0MB 3: 0MB [ 4.108809] EDAC amd64: MC: 4: 0MB 5: 0MB [ 4.108810] EDAC amd64: MC: 6: 0MB 7: 0MB [ 4.108810] EDAC amd64: using x8 syndromes. [ 4.108811] EDAC amd64: MCT channel count: 1 [ 4.108915] EDAC MC0: Giving out device to module amd64_edac controller F17h: DEV 0000:00:18.3 (INTERRUPT) [ 4.108929] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED) [ 4.108929] AMD64 EDAC driver v3.5.0
It states ECC is enabled and that the extra bit for ECC is available via x8 syndromes.
Ryzen supports ECC, for more details read http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/75030-ecc-memory-amds-ryzen-deep-dive.html. But it will not halt the system in case of an ECC error, so there is definitely still an open task for ECC on Ryzen at least somewhere in Linux, microcode or any other component.
HDD/SSD health should be checked regularly, especially on file servers.
pacman -Sy smartmontools
pacman -Sy gsmartcontrol
or from AUR the KDE disKmonitor
Enable automatic smart checks and mail service (very important on real hardware servers). For details read https://wiki.archlinux.org/index.php/S.M.A.R.T..
sudo systemctl enable smartd.service
If you need you can modify the settings in:
$ nano /etc/smartd.conf
To get the complete overview about a drive, run:
$ smartctl -a /dev/sda
To get just a short health status, run:
$ smartctl -H /dev/sda
To run the short, conveyance or long self checks (not every check is provided by every disk, and sadly it doesn’t work for USB drives), execute:
$ smartctl -t long /dev/sda
To view test results:
smartctl -l selftest /dev/sda
Setup automatic Mails about drive healt (OPEN)
# -m Send a Mail to this address, -M send a mail after each start of the service so that you know which drive will be monitored
DEVICESCAN -m firstname.lastname@example.org -M test
Do not spin up disk in standby:
DEVICESCAN -n standby,15,q