go/tlbmc-sensor-debug
By default, tlBMC store will fail to create when sensor creation fails. Note that with traditional dbus-sensors daemons, sensor creation failures are silent and will only result in the objects not being created. This results in partial data being reported and hides missing sensors in the redfish tree unless they are directly queried by a client. tlBMC's behavior allows for verification that all TlbmcOwned sensors declared in Entity Manager Config files are successfully created and served by tlBMC.
You can verify that tlBMC has failed to initialize by querying the tlBMC root, which may result in the following output:
root@HOSTNAME:~# curl localhost/redfish/tlbmc { "error": { "@Message.ExtendedInfo": [ { "@odata.type": "#Message.v1_1_1.Message", "Message": "The requested resource of type named 'tlbmc' was not found.", "MessageArgs": [ "", "tlbmc" ], "MessageId": "Base.1.13.0.ResourceNotFound", "MessageSeverity": "Critical", "Resolution": "Provide a valid resource identifier and resubmit the request." } ], "code": "Base.1.13.0.ResourceNotFound", "message": "The requested resource of type named 'tlbmc' was not found." } }
This output indicates that tlBMC is currently disabled and will result in a fallback to traditional bmcweb. Note that sensor readings may not be present in this case if SkipDbusRead
is configured in EM config files per sensor. This guide will describe some common methods for how to debug which sensors fail to be created in this situation.
To verify that tlBMC has failed during creation of a sensor, we can check the bmcweb logs. To check logs, use the following command and expect similar output:
root@HOSTNAME:~# journalctl -u bmcweb | grep -i tlbmc ... Jul 06 09:09:54 {HOSTNAME} bmcweb[1955026]: E0706 09:09:54.072554 1955026 webserver_main_setup.hpp:232] Cannot create tlBMC store!! Error: INTERNAL: Failed to find hwmon under /sys/bus/i2c/devices/i2c-1/1-001a - Disabling tlBMC
This log indicates a failure to create a sensor object in tlBMC, likely due to a failure to initialize a sensor which may indicate a real hardware failure.
To find additional information about the exact sensor that is failing, you can utilize the bus and address information derived from the log at /sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS}
. By checking the EM config files associated with this platform, you can find the sensor configuration associated with the given bus/address. Verifying that the bus/address is correct for the intended sensor is an important first step for debug.
Checking that the expected hwmon file is present on the machine is necessary to verify that sensor readings are working as intended. To do so, use the following commands and you should expect to see similar output:
root@HOSTNAME:~# ls /sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS} driver hwmon modalias name of_node pec subsystem uevent
Failure to find the hwmon directory shown above could indicate a larger issue, such as a real hardware failure (see below).
If the hwmon directory is present, you can verify the intended value of the sensor by checking the following:
root@HOSTNAME:~# ls /sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS}/hwmon/hwmon{*}/ curr1_crit curr3_input in1_label in3_lcrit_alarm power2_label temp2_crit_alarm curr1_crit_alarm curr3_label in1_lcrit in4_crit power3_input temp2_input curr1_input curr3_max in1_lcrit_alarm in4_crit_alarm power3_label temp2_lcrit curr1_label curr3_max_alarm in1_max in4_input power4_input temp2_lcrit_alarm curr1_max curr4_crit in1_max_alarm in4_label power4_label temp2_max curr1_max_alarm curr4_crit_alarm in1_min in4_lcrit subsystem temp2_max_alarm curr2_crit curr4_input in1_min_alarm in4_lcrit_alarm temp1_crit temp3_crit curr2_crit_alarm curr4_label in2_input name temp1_crit_alarm temp3_crit_alarm curr2_input curr4_max in2_label of_node temp1_input temp3_input curr2_label curr4_max_alarm in3_crit power1_alarm temp1_lcrit temp3_lcrit curr2_max device in3_crit_alarm power1_input temp1_lcrit_alarm temp3_lcrit_alarm curr2_max_alarm in1_crit in3_input power1_label temp1_max temp3_max curr3_crit in1_crit_alarm in3_label power2_alarm temp1_max_alarm temp3_max_alarm curr3_crit_alarm in1_input in3_lcrit power2_input temp2_crit uevent
To find the file that corresponds to the sensor you are interested in, you can cat
the value of the {sensor_type}{*}_label
files to find one that matches the label in the EM config. For instance, if the sensor name in the EM config is vout1_Name
, the value vout1
will be present in one of in1_label
, in2_label
, in3_label
, or in4_label
. The sensor reading will be in the corresponding in{*}_input
file. Verify that this is a well-formed sensor reading value as expected.
The method described above has the disadvantage of only being able to diagnose a single sensor at a time. tlBMC conveniently provides an option to configure bypassing sensor creation failures while still providing useful information for debugging. We provide a setting in the tlBMC central configuration to allow_sensor_creation_failure
.
This setting allows tlBMC store to be created regardless of sensor creation failures. Valid sensors will still be served by tlBMC and debug information can be obtained from tlBMC debug paths such as:
root@HOSTNAME:~# curl localhost/redfish/tlbmc/AllSensors { ... "error": { "@Message.ExtendedInfo": [ { "@odata.type": "#Message.v1_1_1.Message", "Message": "Sensor temperature_{SENSOR_NAME} is not ready in tlBMC Store: Failed to read from input device: No such device or address; input device path: /sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS}/hwmon/hwmon{*}/temp{*}_input", "MessageId": "Base.1.13.0.InternalError" }, { "@odata.type": "#Message.v1_1_1.Message", "Message": "Sensor voltage_{SENSOR_NAME} is not ready in tlBMC Store: Read data can't be converted to a number: Invalid argument", "MessageId": "Base.1.13.0.InternalError" }, ... ], "code": "Base.1.8.GeneralError", "message": "A general error has occurred. See Resolution for information on how to resolve the error." } }
All sensor creation errors encountered during tlBMC store creation are combined in the AllSensors response following the Redfish error message spec.
To enable the allow_sensor_creation_failure
feature, a change must be made similar to: https://gbmc-private-review.git.corp.google.com/c/meta-google-private/+/35823. Add/modify the entry corresponding with the desired platform to include:
sensor_collector_module { enabled: true allow_sensor_creation_failure: true }
Build and flash a bmcweb binary including this change to have the central config take effect.
For additional information to diagnose sensor failures, it may be helpful to check logs using dmesg
. Consider using the following command and look for logs similar to the following:
root@HOSTNAME:~# dmesg | grep "Failed to register" [ 92.520183] i2c i2c-{BUS}: Failed to register i2c client {DRIVER} at 0x{ADDRESS} (-16)
This indicates a failure to set up the device which could indicate a real failure or the device could have been occupied by another script or service during boot.
Potential short term solutions in this case could be to:
Manually bind the device using:
root@HOSTNAME:~# echo "{BUS}-00{ADDRESS}" > /sys/bus/i2c/drivers/{DRIVER}/bind
If the device was temporarily occupied during boot, this may correctly set up the device.
Powercycle the machine: rebooting has fixed sensor instantiation in some cases
If either approach above is used, a bug should still be filed and the issue should be reproduced. This flakiness in sensor creation could mask underlying problems e.g. b/428930642.
Note: In some cases, it may be expected to see some Failed to register i2c client
logs, for instance in the case of having sensors configured in the EM config for second-source boards. These sensors may be expected to fail to create if the FRU on the machine does not correspond with the second-source FRU. Also note that expected dmesg error logs are only possible when these sensors are not supported by tlBMC. If tlBMC were to support the second source board sensors, a separate config would have to be made to logically separate these sensors and probe accordingly.