dbus-sensors: nvmesensor: add 1s timeout for mi worker cv.wait
The main thread `workerCv.notify_all` function may be called
when mi worker thread is not in waiting state because there is a
small time window between the `io.run()` and the
`std::unique_lock<std::mutex> lock(mtx)` statement.
This corner case is not a big issue after a subsystem is fully
started because there is periodical health poll to notify the
mi worker thread. However, for the `Status::Initializing` state,
this corner case will cause mi worker thread never wake up to
execute handler.
There exist many solutions to fix this problem. The simplest (one line)
fix should be adding 1 second timeout for the `cv.wait`.
Tested:
Reproduce script: https://paste.googleplex.com/5229001098919936
Before the fix, if we keep restarting nvmesensor for hundreds of loops,
we will find that at least one subsystem keeps on printing:
subsystem is intiatilzing, cancel the health poll
processing health data has been cancelled
After the fix, the same restarting test passed hundreds of loops without
printing the message above.
Google-Bug-Id: 310663382
Change-Id: I6493e87d1ee555a8fba581470da503b0c67bec27
Signed-off-by: Jinliang Wang <jinliangw@google.com>
(cherry picked from commit d65f03ce6549a6be9cb0f373460a14bcca689fed)
diff --git a/recipes-phosphor/sensors/dbus-sensors/0023-nvmesensor-add-1s-timeout-for-mi-worker-condition_va.patch b/recipes-phosphor/sensors/dbus-sensors/0023-nvmesensor-add-1s-timeout-for-mi-worker-condition_va.patch
new file mode 100644
index 0000000..6ab27c7
--- /dev/null
+++ b/recipes-phosphor/sensors/dbus-sensors/0023-nvmesensor-add-1s-timeout-for-mi-worker-condition_va.patch
@@ -0,0 +1,84 @@
+From 5747098fd523b5bf170efef4cc9a18aab1a885d0 Mon Sep 17 00:00:00 2001
+From: Jinliang Wang <jinliangw@google.com>
+Date: Fri, 17 Nov 2023 22:33:20 -0800
+Subject: [PATCH] nvmesensor: add 1s timeout for mi worker condition_variable
+ wait
+
+The main thread `workerCv.notify_all` function may be called
+when mi worker thread is not in waiting state because there is a
+small time window between the `io.run()` and the
+`std::unique_lock<std::mutex> lock(mtx)` statement.
+
+This corner case is not a big issue after a subsystem is fully
+started because there is periodical health poll to notify the
+mi worker thread. However, for the `Status::Initializing` state,
+this corner case will cause mi worker thread never wake up to
+execute handler.
+
+There exist many solutions to fix this problem. The simplest one line
+fix should be adding 1 second timeout for the `cv.wait`.
+
+Tested:
+Before the fix, if we keep restarting nvmesensor for hundreds of loops,
+we will find at least one subsystem keeps on printing:
+
+ subsystem is intiatilzing, cancel the health poll
+ processing health data has been cancelled
+
+After the fix, the same restarting test passed hundreds of loops
+without printing the message above.
+
+Google-Bug-Id: b/310663382
+Change-Id: I5e1daa8b89f3dbd083a79b50c6c5093614791936
+Signed-off-by: Jinliang Wang <jinliangw@google.com>
+---
+ src/NVMeMi.cpp | 3 ++-
+ src/NVMeSubsys.cpp | 4 ++--
+ 2 files changed, 4 insertions(+), 3 deletions(-)
+
+diff --git a/src/NVMeMi.cpp b/src/NVMeMi.cpp
+index babcc3c..d50d041 100644
+--- a/src/NVMeMi.cpp
++++ b/src/NVMeMi.cpp
+@@ -10,6 +10,7 @@
+ #include <cerrno>
+ #include <fstream>
+ #include <iostream>
++#include <chrono>
+
+ std::map<int, std::weak_ptr<NVMeMi::Worker>> NVMeMi::workerMap{};
+
+@@ -205,7 +206,7 @@ NVMeMi::Worker::Worker()
+ io.restart();
+ {
+ std::unique_lock<std::mutex> lock(mtx);
+- cv.wait(lock);
++ cv.wait_for(lock, std::chrono::seconds(1));
+ if (stop)
+ {
+ // exhaust all tasks and exit
+diff --git a/src/NVMeSubsys.cpp b/src/NVMeSubsys.cpp
+index 1b889ea..96339cd 100644
+--- a/src/NVMeSubsys.cpp
++++ b/src/NVMeSubsys.cpp
+@@ -1112,7 +1112,7 @@ void NVMeSubsystem::attachCtrlVolume(uint16_t c, uint32_t ns)
+ return;
+ }
+ attached[c].insert(ns);
+- std::cout << name << " attached insert " << c << " " << ns << "\n";
++ std::cerr << name << " attached insert " << c << " " << ns << std::endl;
+ controllers[c].first->updateAssociation();
+ }
+
+@@ -1129,7 +1129,7 @@ void NVMeSubsystem::detachCtrlVolume(uint16_t c, uint32_t ns)
+ return;
+ }
+ attached[c].erase(ns);
+- std::cout << name << " attached erase " << c << " " << ns << "\n";
++ std::cerr << name << " attached erase " << c << " " << ns << std::endl;
+ controllers[c].first->updateAssociation();
+ }
+
+--
+2.43.0.rc0.421.g78406f8d94-goog
+
diff --git a/recipes-phosphor/sensors/dbus-sensors_%.bbappend b/recipes-phosphor/sensors/dbus-sensors_%.bbappend
index 0b78e04..eaa07d4 100644
--- a/recipes-phosphor/sensors/dbus-sensors_%.bbappend
+++ b/recipes-phosphor/sensors/dbus-sensors_%.bbappend
@@ -91,6 +91,7 @@
file://0020-nvmesensor-add-primary-info-for-plugins.patch \
file://0021-nvmesensor-use-processSecondaryControllerList-to-rep.patch \
file://0022-nvmesensor-refactor-createVolume-and-createNamespace.patch \
+ file://0023-nvmesensor-add-1s-timeout-for-mi-worker-condition_va.patch \
"
PACKAGECONFIG[nvmesensor] = "-Dnvme=enabled, -Dnvme=disabled, libnvme"
SYSTEMD_SERVICE:${PN} += "${@bb.utils.contains('PACKAGECONFIG', 'nvmesensor', \