smartdのエラー

今朝、自宅鯖がエラーを飛ばしてきた。

Aug  2 06:25:37 www smartd[1061]: Device: /dev/nvme2, number of Error Log entries increased from 0 to 18446744073709551615

不自然に大きい数字はたいがい変数の型の最大値で、これはunsigned long longの最大値。とはいえ、こう暑い日が続くとSSDが故障してもおかしくはないので、とりま、smartdを実行してみる。

# smartctl -a /dev/nvme2n1
smartctl 7.1 2020-04-05 r5049 [x86_64-linux-4.18.0-477.15.1.el8_8.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       XPG GAMMIX S70 BLADE
Serial Number:                      2L252L16E9WL
Firmware Version:                   3.2.F.P7
PCI Vendor ID:                      0x1dbe
PCI Vendor Subsystem ID:            0x5236
IEEE OUI Identifier:                0x00abcd
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            494e4e 4f47524954
Local Time is:                      Wed Aug  2 08:22:44 2023 JST
Firmware Updates (0x0e):            7 Slots
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x0014):     DS_Mngmt Sav/Sel_Feat
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     110 Celsius
Critical Comp. Temp. Threshold:     120 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.50W       -        -    0  0  0  0        5       5
 1 +     3.30W       -        -    1  1  1  1       50     100
 2 +     2.80W       -        -    2  2  2  2       50     200
 3 -   0.1000W       -        -    3  3  3  3      500    5000
 4 -   0.0080W       -        -    4  4  4  4     2000   60000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        49 Celsius
Available Spare:                    100%
Available Spare Threshold:          25%
Percentage Used:                    5%
Data Units Read:                    12,262,245 [6.27 TB]
Data Units Written:                 34,459,157 [17.6 TB]
Host Read Commands:                 58,055,554
Host Write Commands:                376,086,791
Controller Busy Time:               0
Power Cycles:                       45
Power On Hours:                     14,831
Unsafe Shutdowns:                   8
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               49 Celsius
Temperature Sensor 2:               30 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

特に問題はなさそう。

さて、何がこんなエラーを出力させているのかということで、軽く調べる。

# yumdownloader --source smartmontools
# rpm -ivh smartmontools
# cd rpmbuild/SOURCES/
# tar zxvf smartmontools-7.1.tar.gz
# cd smartmontools-7.1
# grep -r 'number of Error Log entries increased' *
smartd.cpp:      PrintOut(LOG_CRIT, "Device: %s, number of Error Log entries increased from %" PRIu64 " to %" PRIu64 "\n",
smartd.cpp:      MailWarning(cfg, state, 4, "Device: %s, number of Error Log entries increased from %" PRIu64 " to %" PRIu64,

smartd.cppから出力されてる。

3761   // Check if number of errors has increased
3762   if (cfg.errorlog || cfg.xerrorlog) {
3763     uint64_t oldcnt = state.nvme_err_log_entries;
3764     uint64_t newcnt = le128_to_uint64(smart_log.num_err_log_entries);
3765     if (newcnt > oldcnt) {
3766       PrintOut(LOG_CRIT, "Device: %s, number of Error Log entries increased from %" PRIu64 " to %" PRIu64 "\n",
3767                name, oldcnt, newcnt);
3768       MailWarning(cfg, state, 4, "Device: %s, number of Error Log entries increased from %" PRIu64 " to %" PRIu64,
3769                   name, oldcnt, newcnt);
3770       state.must_write = true;
3771     }
3772     state.nvme_err_log_entries = newcnt;
3773   }

3764行のle128_to_uint64()は、上位8ビットが1つでも立ってたら、uint64_tの最大値を返してるだけ。

2606 // Convert 128 bit LE integer to uint64_t or its max value on overflow.
2607 static uint64_t le128_to_uint64(const unsigned char (& val)[16])
2608 {
2609   for (int i = 8; i < 16; i++) {
2610     if (val[i])
2611       return ~(uint64_t)0;
2612   }
2613   uint64_t lo = val[7];
2614   for (int i = 7-1; i >= 0; i--) {
2615     lo <<= 8; lo += val[i];
2616   }
2617   return lo;
2618 }

le128_to_uint64()の引数であるsmart_log.num_err_log_entriesは、3716行で読んできた構造体で、

3714   // Read SMART/Health log
3715   nvme_smart_log smart_log;
3716   if (!nvme_read_smart_log(nvmedev, smart_log)) {
3717       PrintOut(LOG_INFO, "Device: %s, failed to read NVMe SMART/Health Information\n", name);
3718       MailWarning(cfg, state, 6, "Device: %s, failed to read NVMe SMART/Health Information", name);
3719       state.must_write = true;
3720       return 0;
3721   }

読み取りの実態は、smartmontools-7.1/nvmecmds.cppにある、nvme_read_smart_log()

221 // Read NVMe SMART/Health Information log.
222 bool nvme_read_smart_log(nvme_device * device, nvme_smart_log & smart_log)
223 {
224   if (!nvme_read_log_page(device, 0x02, &smart_log, sizeof(smart_log), true))
225     return false;
226 
227   if (isbigendian()) {
228     swapx(&smart_log.warning_temp_time);
229     swapx(&smart_log.critical_comp_time);
230     for (int i = 0; i < 8; i++)
231       swapx(&smart_log.temp_sensor[i]);
232   }
233 
234   return true;
235 }

うーん、ビット化けでもしたんかなぁ…。