go/tlbmc-sensor-debug
By default, tlBMC store will fail to create when sensor creation fails. Note that with traditional dbus-sensors daemons, sensor creation failures are silent and will only result in the objects not being created. This results in partial data being reported and hides missing sensors in the redfish tree unless they are directly queried by a client. tlBMC's behavior allows for verification that all TlbmcOwned sensors declared in Entity Manager Config files are successfully created and served by tlBMC.
Note that sensor readings may still not be present even if tlBMC is disabled if SkipDbusRead is configured in EM config files per sensor. This guide will describe some common methods for how to debug which sensors fail to be created in this situation.
To verify that tlBMC has failed during creation of a sensor, we can check the bmcweb logs. To check logs, use the following command and expect similar output:
root@HOSTNAME:~# journalctl -u bmcweb | grep -i tlbmc
...
Jul 06 09:09:54 {HOSTNAME} bmcweb[1955026]: E0706 09:09:54.072554 1955026 webserver_main_setup.hpp:232] Cannot create tlBMC store!! Error: INTERNAL: Failed to find hwmon under /sys/bus/i2c/devices/i2c-1/1-001a - Disabling tlBMC
This log indicates a failure to create a sensor object in tlBMC, likely due to a failure to initialize a sensor which may indicate a real hardware failure.
To find additional information about the exact sensor that is failing, you can utilize the bus and address information derived from the log at /sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS}. By checking the EM config files associated with this platform, you can find the sensor configuration associated with the given bus/address. Verifying that the bus/address is correct for the intended sensor is an important first step for debug.
Checking that the expected hwmon file is present on the machine is necessary to verify that sensor readings are working as intended. To do so, use the following commands and you should expect to see similar output:
root@HOSTNAME:~# ls /sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS}
driver hwmon modalias name of_node pec subsystem uevent
Failure to find the hwmon directory shown above could indicate a larger issue, such as a real hardware failure (see below).
If the hwmon directory is present, you can verify the intended value of the sensor by checking the following:
root@HOSTNAME:~# ls /sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS}/hwmon/hwmon{*}/
curr1_crit curr3_input in1_label in3_lcrit_alarm power2_label temp2_crit_alarm
curr1_crit_alarm curr3_label in1_lcrit in4_crit power3_input temp2_input
curr1_input curr3_max in1_lcrit_alarm in4_crit_alarm power3_label temp2_lcrit
curr1_label curr3_max_alarm in1_max in4_input power4_input temp2_lcrit_alarm
curr1_max curr4_crit in1_max_alarm in4_label power4_label temp2_max
curr1_max_alarm curr4_crit_alarm in1_min in4_lcrit subsystem temp2_max_alarm
curr2_crit curr4_input in1_min_alarm in4_lcrit_alarm temp1_crit temp3_crit
curr2_crit_alarm curr4_label in2_input name temp1_crit_alarm temp3_crit_alarm
curr2_input curr4_max in2_label of_node temp1_input temp3_input
curr2_label curr4_max_alarm in3_crit power1_alarm temp1_lcrit temp3_lcrit
curr2_max device in3_crit_alarm power1_input temp1_lcrit_alarm temp3_lcrit_alarm
curr2_max_alarm in1_crit in3_input power1_label temp1_max temp3_max
curr3_crit in1_crit_alarm in3_label power2_alarm temp1_max_alarm temp3_max_alarm
curr3_crit_alarm in1_input in3_lcrit power2_input temp2_crit uevent
To find the file that corresponds to the sensor you are interested in, you can cat the value of the {sensor_type}{*}_label files to find one that matches the label in the EM config. For instance, if the sensor name in the EM config is vout1_Name, the value vout1 will be present in one of in1_label, in2_label, in3_label, or in4_label. The sensor reading will be in the corresponding in{*}_input file. Verify that this is a well-formed sensor reading value as expected.
The method described above has the disadvantage of only being able to diagnose a single sensor at a time. tlBMC conveniently provides an option to configure bypassing sensor creation failures while still providing useful information for debugging. We provide a setting in the tlBMC central configuration to allow_sensor_creation_failure.
This setting allows tlBMC store to be created regardless of sensor creation failures. Valid sensors will still be served by tlBMC and debug information can be obtained from tlBMC debug paths such as:
root@HOSTNAME:~# curl localhost/redfish/tlbmc/AllSensors
{
...
"error": {
"@Message.ExtendedInfo": [
{
"@odata.type": "#Message.v1_1_1.Message",
"Message": "Sensor temperature_{SENSOR_NAME} is not ready in tlBMC Store: Failed to read from input device: No such device or address; input device path: /sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS}/hwmon/hwmon{*}/temp{*}_input",
"MessageId": "Base.1.13.0.InternalError"
},
{
"@odata.type": "#Message.v1_1_1.Message",
"Message": "Sensor voltage_{SENSOR_NAME} is not ready in tlBMC Store: Read data can't be converted to a number: Invalid argument",
"MessageId": "Base.1.13.0.InternalError"
},
...
],
"code": "Base.1.8.GeneralError",
"message": "A general error has occurred. See Resolution for information on how to resolve the error."
}
}
All sensor creation errors encountered during tlBMC store creation are combined in the AllSensors response following the Redfish error message spec.
To enable the allow_sensor_creation_failure feature, a change must be made similar to: https://gbmc-private-review.git.corp.google.com/c/meta-google-private/+/35823. Add/modify the entry corresponding with the desired platform to include:
sensor_collector_module { enabled: true allow_sensor_creation_failure: true }
Build and flash a bmcweb binary including this change to have the central config take effect.
For additional information to diagnose sensor failures, it may be helpful to check logs using dmesg. Consider using the following command and look for logs similar to the following:
root@HOSTNAME:~# dmesg | grep "Failed to register"
[ 92.520183] i2c i2c-{BUS}: Failed to register i2c client {DRIVER} at 0x{ADDRESS} (-16)
This indicates a failure to set up the device which could indicate a real failure or the device could have been occupied by another script or service during boot.
Potential short term solutions in this case could be to:
Manually bind the device using:
root@HOSTNAME:~# echo "{BUS}-00{ADDRESS}" >
/sys/bus/i2c/drivers/{DRIVER}/bind
If the device was temporarily occupied during boot, this may correctly set up the device.
Powercycle the machine: rebooting has fixed sensor instantiation in some cases
If either approach above is used, a bug should still be filed and the issue should be reproduced. This flakiness in sensor creation could mask underlying problems e.g. b/428930642.
Note: In some cases, it may be expected to see some Failed to register i2c client logs, for instance in the case of having sensors configured in the EM config for second-source boards. These sensors may be expected to fail to create if the FRU on the machine does not correspond with the second-source FRU. Also note that expected dmesg error logs are only possible when these sensors are not supported by tlBMC. If tlBMC were to support the second source board sensors, a separate config would have to be made to logically separate these sensors and probe accordingly.