Sensor Debug Guide

go/tlbmc-sensor-debug

By default, tlBMC store will fail to create when sensor creation fails. Note that with traditional dbus-sensors daemons, sensor creation failures are silent and will only result in the objects not being created. This results in partial data being reported and hides missing sensors in the redfish tree unless they are directly queried by a client. tlBMC's behavior allows for verification that all TlbmcOwned sensors declared in Entity Manager Config files are successfully created and served by tlBMC.

You can verify that tlBMC has failed to initialize by querying the tlBMC root, which may result in the following output:

root@HOSTNAME:~# curl localhost/redfish/tlbmc
{
  "error": {
    "@Message.ExtendedInfo": [
      {
        "@odata.type": "#Message.v1_1_1.Message",
        "Message": "The requested resource of type  named 'tlbmc' was not found.",
        "MessageArgs": [
          "",
          "tlbmc"
        ],
        "MessageId": "Base.1.13.0.ResourceNotFound",
        "MessageSeverity": "Critical",
        "Resolution": "Provide a valid resource identifier and resubmit the request."
      }
    ],
    "code": "Base.1.13.0.ResourceNotFound",
    "message": "The requested resource of type  named 'tlbmc' was not found."
  }
}

This output indicates that tlBMC is currently disabled and will result in a fallback to traditional bmcweb. Note that sensor readings may not be present in this case if SkipDbusRead is configured in EM config files per sensor. This guide will describe some common methods for how to debug which sensors fail to be created in this situation.

Check tlBMC Store Creation Failure Logs

To verify that tlBMC has failed during creation of a sensor, we can check the bmcweb logs. To check logs, use the following command and expect similar output:

root@HOSTNAME:~# journalctl -u bmcweb | grep -i tlbmc
...
Jul 06 09:09:54 {HOSTNAME} bmcweb[1955026]: E0706 09:09:54.072554 1955026 webserver_main_setup.hpp:232] Cannot create tlBMC store!! Error: INTERNAL: Failed to find hwmon under /sys/bus/i2c/devices/i2c-1/1-001a - Disabling tlBMC

This log indicates a failure to create a sensor object in tlBMC, likely due to a failure to initialize a sensor which may indicate a real hardware failure.

To find additional information about the exact sensor that is failing, you can utilize the bus and address information derived from the log at /sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS}. By checking the EM config files associated with this platform, you can find the sensor configuration associated with the given bus/address. Verifying that the bus/address is correct for the intended sensor is an important first step for debug.

Verify hwmon File Presence and Value

Checking that the expected hwmon file is present on the machine is necessary to verify that sensor readings are working as intended. To do so, use the following commands and you should expect to see similar output:

root@HOSTNAME:~# ls /sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS}
driver     hwmon      modalias   name       of_node    pec        subsystem  uevent

Failure to find the hwmon directory shown above could indicate a larger issue, such as a real hardware failure (see below).

If the hwmon directory is present, you can verify the intended value of the sensor by checking the following:

root@HOSTNAME:~# ls /sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS}/hwmon/hwmon{*}/
curr1_crit         curr3_input        in1_label          in3_lcrit_alarm    power2_label       temp2_crit_alarm
curr1_crit_alarm   curr3_label        in1_lcrit          in4_crit           power3_input       temp2_input
curr1_input        curr3_max          in1_lcrit_alarm    in4_crit_alarm     power3_label       temp2_lcrit
curr1_label        curr3_max_alarm    in1_max            in4_input          power4_input       temp2_lcrit_alarm
curr1_max          curr4_crit         in1_max_alarm      in4_label          power4_label       temp2_max
curr1_max_alarm    curr4_crit_alarm   in1_min            in4_lcrit          subsystem          temp2_max_alarm
curr2_crit         curr4_input        in1_min_alarm      in4_lcrit_alarm    temp1_crit         temp3_crit
curr2_crit_alarm   curr4_label        in2_input          name               temp1_crit_alarm   temp3_crit_alarm
curr2_input        curr4_max          in2_label          of_node            temp1_input        temp3_input
curr2_label        curr4_max_alarm    in3_crit           power1_alarm       temp1_lcrit        temp3_lcrit
curr2_max          device             in3_crit_alarm     power1_input       temp1_lcrit_alarm  temp3_lcrit_alarm
curr2_max_alarm    in1_crit           in3_input          power1_label       temp1_max          temp3_max
curr3_crit         in1_crit_alarm     in3_label          power2_alarm       temp1_max_alarm    temp3_max_alarm
curr3_crit_alarm   in1_input          in3_lcrit          power2_input       temp2_crit         uevent

To find the file that corresponds to the sensor you are interested in, you can cat the value of the {sensor_type}{*}_label files to find one that matches the label in the EM config. For instance, if the sensor name in the EM config is vout1_Name, the value vout1 will be present in one of in1_label, in2_label, in3_label, or in4_label. The sensor reading will be in the corresponding in{*}_input file. Verify that this is a well-formed sensor reading value as expected.

Enable allow_sensor_creation_failure in tlBMC Configuration

The method described above has the disadvantage of only being able to diagnose a single sensor at a time. tlBMC conveniently provides an option to configure bypassing sensor creation failures while still providing useful information for debugging. We provide a setting in the tlBMC central configuration to allow_sensor_creation_failure.

This setting allows tlBMC store to be created regardless of sensor creation failures. Valid sensors will still be served by tlBMC and debug information can be obtained from tlBMC debug paths such as:

root@HOSTNAME:~# curl localhost/redfish/tlbmc/AllSensors
{
...
"error": {
    "@Message.ExtendedInfo": [
      {
        "@odata.type": "#Message.v1_1_1.Message",
        "Message": "Sensor temperature_{SENSOR_NAME} is not ready in tlBMC Store: Failed to read from input device: No such device or address; input device path: /sys/bus/i2c/devices/i2c-{BUS}/{BUS}-00{ADDRESS}/hwmon/hwmon{*}/temp{*}_input",
        "MessageId": "Base.1.13.0.InternalError"
      },
      {
        "@odata.type": "#Message.v1_1_1.Message",
        "Message": "Sensor voltage_{SENSOR_NAME} is not ready in tlBMC Store: Read data can't be converted to a number: Invalid argument",
        "MessageId": "Base.1.13.0.InternalError"
      },
      ...
    ],
    "code": "Base.1.8.GeneralError",
    "message": "A general error has occurred. See Resolution for information on how to resolve the error."
  }
}

All sensor creation errors encountered during tlBMC store creation are combined in the AllSensors response following the Redfish error message spec.

To enable the allow_sensor_creation_failure feature, a change must be made similar to: https://gbmc-private-review.git.corp.google.com/c/meta-google-private/+/35823. Add/modify the entry corresponding with the desired platform to include:

sensor_collector_module { enabled: true allow_sensor_creation_failure: true }

Build and flash a bmcweb binary including this change to have the central config take effect.

Verify Real Hardware Failures

For additional information to diagnose sensor failures, it may be helpful to check logs using dmesg. Consider using the following command and look for logs similar to the following:

root@HOSTNAME:~# dmesg | grep "Failed to register"
[   92.520183] i2c i2c-{BUS}: Failed to register i2c client {DRIVER} at 0x{ADDRESS} (-16)

This indicates a failure to set up the device which could indicate a real failure or the device could have been occupied by another script or service during boot.

Potential short term solutions in this case could be to:

  • Manually bind the device using:

    root@HOSTNAME:~# echo "{BUS}-00{ADDRESS}" >
    /sys/bus/i2c/drivers/{DRIVER}/bind
    

    If the device was temporarily occupied during boot, this may correctly set up the device.

  • Powercycle the machine: rebooting has fixed sensor instantiation in some cases

    If either approach above is used, a bug should still be filed and the issue should be reproduced. This flakiness in sensor creation could mask underlying problems e.g. b/428930642.

Note: In some cases, it may be expected to see some Failed to register i2c client logs, for instance in the case of having sensors configured in the EM config for second-source boards. These sensors may be expected to fail to create if the FRU on the machine does not correspond with the second-source FRU. Also note that expected dmesg error logs are only possible when these sensors are not supported by tlBMC. If tlBMC were to support the second source board sensors, a separate config would have to be made to logically separate these sensors and probe accordingly.