dbus-sensors: nvmesensor: add 1s timeout for mi worker cv.wait
The main thread `workerCv.notify_all` function may be called
when mi worker thread is not in waiting state because there is a
small time window between the `io.run()` and the
`std::unique_lock<std::mutex> lock(mtx)` statement.
This corner case is not a big issue after a subsystem is fully
started because there is periodical health poll to notify the
mi worker thread. However, for the `Status::Initializing` state,
this corner case will cause mi worker thread never wake up to
execute handler.
There exist many solutions to fix this problem. The simplest (one line)
fix should be adding 1 second timeout for the `cv.wait`.
Tested:
Reproduce script: https://paste.googleplex.com/5229001098919936
Before the fix, if we keep restarting nvmesensor for hundreds of loops,
we will find that at least one subsystem keeps on printing:
subsystem is intiatilzing, cancel the health poll
processing health data has been cancelled
After the fix, the same restarting test passed hundreds of loops without
printing the message above.
Google-Bug-Id: 310663382
Change-Id: I6493e87d1ee555a8fba581470da503b0c67bec27
Signed-off-by: Jinliang Wang <jinliangw@google.com>
(cherry picked from commit 5e4bce1c1001cb6a73007acf864a80cc85519693)
diff --git a/recipes-phosphor/sensors/dbus-sensors/0023-nvmesensor-add-1s-timeout-for-mi-worker-condition_va.patch b/recipes-phosphor/sensors/dbus-sensors/0023-nvmesensor-add-1s-timeout-for-mi-worker-condition_va.patch
new file mode 100644
index 0000000..8fd942e
--- /dev/null
+++ b/recipes-phosphor/sensors/dbus-sensors/0023-nvmesensor-add-1s-timeout-for-mi-worker-condition_va.patch
@@ -0,0 +1,63 @@
+From 5747098fd523b5bf170efef4cc9a18aab1a885d0 Mon Sep 17 00:00:00 2001
+From: Jinliang Wang <jinliangw@google.com>
+Date: Fri, 17 Nov 2023 22:33:20 -0800
+Subject: [PATCH] nvmesensor: add 1s timeout for mi worker condition_variable
+ wait
+
+The main thread `workerCv.notify_all` function may be called
+when mi worker thread is not in waiting state because there is a
+small time window between the `io.run()` and the
+`std::unique_lock<std::mutex> lock(mtx)` statement.
+
+This corner case is not a big issue after a subsystem is fully
+started because there is periodical health poll to notify the
+mi worker thread. However, for the `Status::Initializing` state,
+this corner case will cause mi worker thread never wake up to
+execute handler.
+
+There exist many solutions to fix this problem. The simplest one line
+fix should be adding 1 second timeout for the `cv.wait`.
+
+Tested:
+Before the fix, if we keep restarting nvmesensor for hundreds of loops,
+we will find at least one subsystem keeps on printing:
+
+ subsystem is intiatilzing, cancel the health poll
+ processing health data has been cancelled
+
+After the fix, the same restarting test passed hundreds of loops
+without printing the message above.
+
+Google-Bug-Id: b/310663382
+Change-Id: I5e1daa8b89f3dbd083a79b50c6c5093614791936
+Signed-off-by: Jinliang Wang <jinliangw@google.com>
+---
+ src/NVMeMi.cpp | 3 ++-
+ src/NVMeSubsys.cpp | 4 ++--
+ 2 files changed, 4 insertions(+), 3 deletions(-)
+
+diff --git a/src/NVMeMi.cpp b/src/NVMeMi.cpp
+index babcc3c..d50d041 100644
+--- a/src/NVMeMi.cpp
++++ b/src/NVMeMi.cpp
+@@ -10,6 +10,7 @@
+ #include <cerrno>
+ #include <fstream>
+ #include <iostream>
++#include <chrono>
+
+ std::map<int, std::weak_ptr<NVMeMi::Worker>> NVMeMi::workerMap{};
+
+@@ -205,7 +206,7 @@ NVMeMi::Worker::Worker()
+ io.restart();
+ {
+ std::unique_lock<std::mutex> lock(mtx);
+- cv.wait(lock);
++ cv.wait_for(lock, std::chrono::seconds(1));
+ if (stop)
+ {
+ // exhaust all tasks and exit
+
+--
+2.43.0.rc0.421.g78406f8d94-goog
+
diff --git a/recipes-phosphor/sensors/dbus-sensors_%.bbappend b/recipes-phosphor/sensors/dbus-sensors_%.bbappend
index 723f70b..ce0a468 100644
--- a/recipes-phosphor/sensors/dbus-sensors_%.bbappend
+++ b/recipes-phosphor/sensors/dbus-sensors_%.bbappend
@@ -70,6 +70,7 @@
file://0051-nvmesensor-remove-duplicate-chassis-prefix-in-sensor.patch \
file://0052-nvmesensor-add-Admin-non-data-command-passthru-DBus-.patch \
file://0053-nvmesensor-Update-MTU-and-frequency-for-MI-ports.patch \
+ file://0023-nvmesensor-add-1s-timeout-for-mi-worker-condition_va.patch \
"
PACKAGECONFIG[nvmesensor] = "-Dnvme=enabled, -Dnvme=disabled, libnvme"
SYSTEMD_SERVICE:${PN} += "${@bb.utils.contains('PACKAGECONFIG', 'nvmesensor', \