blob: 2793520b1ec60fc4cea66bde8515f51846ed3d93 [file] [log] [blame] [view]
# Sensor Debug Guide
go/tlbmc-sensor-debug
<!--*
# Document freshness: For more information, see go/fresh-source.
freshness: { owner: 'tlbmc-dev' reviewed: '2025-07-08' }
*-->
By default, tlBMC store will fail to create when sensor creation fails. Note
that with traditional dbus-sensors daemons, sensor creation failures are silent
and will only result in the objects not being created. This results in partial
data being reported and hides missing sensors in the redfish tree unless they
are directly queried by a client. tlBMC's behavior allows for verification that
all TlbmcOwned sensors declared in Entity Manager Config files are successfully
created and served by tlBMC.
Note that sensor readings may still not be present even if tlBMC is disabled if
`SkipDbusRead` is configured in EM config files per sensor. This guide will
describe some common methods for how to debug which sensors fail to be created
in this situation.
[TOC]
## Check tlBMC Store Creation Failure Logs
To verify that tlBMC has failed during creation of a sensor, we can check the
bmcweb logs. To check logs, use the following command and expect similar output:
```
root@HOSTNAME:~# journalctl -u bmcweb | grep -i tlbmc
...
Jul 06 09:09:54 {HOSTNAME} bmcweb[1955026]: E0706 09:09:54.072554 1955026 webserver_main_setup.hpp:232] Cannot create tlBMC store!! Error: INTERNAL: Failed to find hwmon under /sys/bus/i2c/devices/i2c-1/1-001a - Disabling tlBMC
```
This log indicates a failure to create a sensor object in tlBMC, likely due to a
failure to initialize a sensor which may indicate a real hardware failure.
To find additional information about the exact sensor that is failing, you can
utilize the bus and address information derived from the log at
`/sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS}`. By checking the EM config
files associated with this platform, you can find the sensor configuration
associated with the given bus/address. Verifying that the bus/address is correct
for the intended sensor is an important first step for debug.
## Verify hwmon File Presence and Value
Checking that the expected hwmon file is present on the machine is necessary to
verify that sensor readings are working as intended. To do so, use the following
commands and you should expect to see similar output:
```
root@HOSTNAME:~# ls /sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS}
driver hwmon modalias name of_node pec subsystem uevent
```
Failure to find the hwmon directory shown above could indicate a larger issue,
such as a real hardware failure (see below).
If the hwmon directory is present, you can verify the intended value of the
sensor by checking the following:
```
root@HOSTNAME:~# ls /sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS}/hwmon/hwmon{*}/
curr1_crit curr3_input in1_label in3_lcrit_alarm power2_label temp2_crit_alarm
curr1_crit_alarm curr3_label in1_lcrit in4_crit power3_input temp2_input
curr1_input curr3_max in1_lcrit_alarm in4_crit_alarm power3_label temp2_lcrit
curr1_label curr3_max_alarm in1_max in4_input power4_input temp2_lcrit_alarm
curr1_max curr4_crit in1_max_alarm in4_label power4_label temp2_max
curr1_max_alarm curr4_crit_alarm in1_min in4_lcrit subsystem temp2_max_alarm
curr2_crit curr4_input in1_min_alarm in4_lcrit_alarm temp1_crit temp3_crit
curr2_crit_alarm curr4_label in2_input name temp1_crit_alarm temp3_crit_alarm
curr2_input curr4_max in2_label of_node temp1_input temp3_input
curr2_label curr4_max_alarm in3_crit power1_alarm temp1_lcrit temp3_lcrit
curr2_max device in3_crit_alarm power1_input temp1_lcrit_alarm temp3_lcrit_alarm
curr2_max_alarm in1_crit in3_input power1_label temp1_max temp3_max
curr3_crit in1_crit_alarm in3_label power2_alarm temp1_max_alarm temp3_max_alarm
curr3_crit_alarm in1_input in3_lcrit power2_input temp2_crit uevent
```
To find the file that corresponds to the sensor you are interested in, you can
`cat` the value of the `{sensor_type}{*}_label` files to find one that matches
the label in the EM config. For instance, if the sensor name in the EM config is
`vout1_Name`, the value `vout1` will be present in one of `in1_label`,
`in2_label`, `in3_label`, or `in4_label`. The sensor reading will be in the
corresponding `in{*}_input` file. Verify that this is a well-formed sensor
reading value as expected.
## Enable allow_sensor_creation_failure in tlBMC Configuration
The method described above has the disadvantage of only being able to diagnose a
single sensor at a time. tlBMC conveniently provides an option to configure
bypassing sensor creation failures while still providing useful information for
debugging. We provide a setting in the tlBMC central configuration to
`allow_sensor_creation_failure`.
This setting allows tlBMC store to be created regardless of sensor creation
failures. Valid sensors will still be served by tlBMC and debug information can
be obtained from tlBMC debug paths such as:
```
root@HOSTNAME:~# curl localhost/redfish/tlbmc/AllSensors
{
...
"error": {
"@Message.ExtendedInfo": [
{
"@odata.type": "#Message.v1_1_1.Message",
"Message": "Sensor temperature_{SENSOR_NAME} is not ready in tlBMC Store: Failed to read from input device: No such device or address; input device path: /sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS}/hwmon/hwmon{*}/temp{*}_input",
"MessageId": "Base.1.13.0.InternalError"
},
{
"@odata.type": "#Message.v1_1_1.Message",
"Message": "Sensor voltage_{SENSOR_NAME} is not ready in tlBMC Store: Read data can't be converted to a number: Invalid argument",
"MessageId": "Base.1.13.0.InternalError"
},
...
],
"code": "Base.1.8.GeneralError",
"message": "A general error has occurred. See Resolution for information on how to resolve the error."
}
}
```
All sensor creation errors encountered during tlBMC store creation are combined
in the AllSensors response following the
[Redfish error message spec](https://redfish.dmtf.org/schemas/DSP0266_1.19.0.html#error-responses).
To enable the `allow_sensor_creation_failure` feature, a change must be made
similar to:
https://gbmc-private-review.git.corp.google.com/c/meta-google-private/+/35823.
Add/modify the entry corresponding with the desired platform to include:
```
sensor_collector_module { enabled: true allow_sensor_creation_failure: true }
```
Build and flash a bmcweb binary including this change to have the central config
take effect.
## Verify Real Hardware Failures
For additional information to diagnose sensor failures, it may be helpful to
check logs using `dmesg`. Consider using the following command and look for logs
similar to the following:
```
root@HOSTNAME:~# dmesg | grep "Failed to register"
[ 92.520183] i2c i2c-{BUS}: Failed to register i2c client {DRIVER} at 0x{ADDRESS} (-16)
```
This indicates a failure to set up the device which could indicate a real
failure or the device could have been occupied by another script or service
during boot.
Potential *short term* solutions in this case could be to:
- Manually bind the device using:
```
root@HOSTNAME:~# echo "{BUS}-00{ADDRESS}" >
/sys/bus/i2c/drivers/{DRIVER}/bind
```
If the device was temporarily occupied during boot, this may correctly set
up the device.
- Powercycle the machine: rebooting has fixed sensor instantiation in some
cases
If either approach above is used, a bug should still be filed and the issue
should be reproduced. This flakiness in sensor creation could mask
underlying problems e.g. b/428930642.
Note: In some cases, it may be expected to see some `Failed to register i2c
client` logs, for instance in the case of having sensors configured in the EM
config for second-source boards. These sensors may be expected to fail to create
if the FRU on the machine does not correspond with the second-source FRU. Also
note that expected dmesg error logs are only possible when these sensors are not
supported by tlBMC. If tlBMC were to support the second source board sensors, a
separate config would have to be made to logically separate these sensors and
probe accordingly.