Clone this repo:

Branches

  1. ebc333a Fix cross scan for open source by Yongbing Chen · 5 weeks ago master
  2. e6f9f14 Yocto bitbake recipe for the gRPC server by Yongbing Chen · 5 weeks ago
  3. b7619af Implement the telemetry streaming gRPC server by Yongbing Chen · 6 weeks ago
  4. d190ce1 Add Redfish chassis resource parsing logic by Yongbing Chen · 4 months ago
  5. f0f71a4 Add the implementation of the telemetry source manager by Yongbing Chen · 5 months ago

The Voyager Streaming Telemetry Server

Table of Contents

Introduction

This is a standalone gRPC server running on all arena management nodes (Baseboard Management Controllers or BMCs) of a data center machine. Its expected clients are telemetry collectors from the cloud in this diagram (or any other place from the network with credentials):

System Diagram

The collector only depends on the Voyager Telemetry gRPC proto definition to interact with this server (server's northbound API). The server polls telemetry sources when subscribed from clients.

If defines a generic “TelemetrySource” API for adding new type of sources into its management (server‘s southbound API). There’s no explicit dependency on existing OpenBMC services on the arena management node, like you can implement I2C sensors as a type of source by reading I2C sysfs file directly, and/or MCTP sockets for sources from host CPUs.

Why not use Redfish SSE for telemetry

OpenBMC uses Server-Sent Events (SSE) for server-pushed events, which can be used for machine telemetry. However, in our real-world use cases, especially with the rise of ML hardware usage in data centers, we found that this interface doesn‘t optimally satisfy our requirements. Here’s why:

1. Millisecond-level telemetry sampling performance

The OpenBMC solution relies on existing telemetry source services, such as the PSUSensor systemd service, to poll the hardware and provide telemetry data via D-Bus object property change events. This approach has several limitations:

  • System-wide telemetry throughput is constrained by D-Bus IPC bandwidth, which is significantly lower than what real hardware sampling can achieve. Hardware sources like I2C sensors can typically be polled at 0.5ms to 5ms intervals. For a system with 100 I2C sensors, that's potentially 20,000 telemetry samples per second, while D-Bus system-wide bandwidth is about one order of magnitude lower.
  • Modeling each sensor as a D-Bus object and exposing its value changes through D-Bus property changes loses the original sampling timestamp information, thus reduces the fidelity of the data in high-resolution use cases.
  • The current approach lacks flexibility in polling rates. Ideally, we should be able to poll some sources at high resolution while keeping the majority at low resolution, considering the limited computing resources of a typical BMC card.

Our new telemetry server addresses these issues by providing dynamically adjustable, millisecond-level telemetry sampling, instead of the fixed polling rate at the second level used in OpenBMC.

2. A unified, direct gRPC telemetry interface without HTTP/Redfish indirection

gRPC naturally provides a streaming interface for telemetry. By removing the assumption that telemetry must go through the Redfish interface, we can eliminate layers of indirection and improve performance:

  • Remove the HTTP request/response layer between cloud collectors and BMC.
  • Eliminate D-Bus IPC (including message marshaling/unmarshaling) between BMC telemetry source services and the BMC telemetry server.

We still use the Redfish specification as our data model, ensuring compatibility with existing telemetry data consumers.

3. Simplified telemetry collector logic

The current Redfish-based BMC telemetry solution requires multiple transactions between client and server to select desired telemetry sources. It also needs additional rules and optimization efforts for optimal telemetry collection. Our new telemetry server simplifies this process by:

This allows a collector client to use xpath-like syntax or a simple server configuration name to subscribe to all interested telemetry sources with desired parameters in a single gRPC call.

4. Threshold-based dynamic sampling rate

For a more meaningful telemetry solution, we want data collection to focus on sources that need more attention. We achieve this through threshold-controlled telemetry sampling rates:

  • The sampling rate of each telemetry source can vary based on its current value.
  • This can be configured either through a static server configuration or via a new threshold configuration passed in the client subscription request.

This approach ensures that critical data is collected more frequently when needed, while conserving resources during normal operation.

Components Diagram

The core concept of this telemetry server is the Telemetry Source Manager. For more details, please see the Telemetry Source Manager README.

Components Diagram

Subscription Sequence Diagram

A subscribe call from a collector client creates a bi-directional stream:

Subscription Sequence Diagram

A client can now subscribe to a set of telemetry sources using one of the following Request:

  1. Server configuration name:

    TelemetryRequest {
        req_id: "req_repairability".into(),
        req_config_group: "repairability_basic_cfg_group".into(),
        ..Default::default()
    }
    
  2. XPath-like query:

    Fqp {
        specifier: "/redfish/v1/Chassis/{ChassisId}/Sensors/{SensorId}".into(),
        identifiers: HashMap::from([
            ("ChassisId".into(), "*".into()),
            ("SensorId".into(), "*".into()),
        ]),
        r#type: FqpType::NotSet as i32,
        ..Default::default()
    }
    
  3. Redfish data type:

    Fqp {
        specifier: "#Sensor.v1_2_0.Sensor".into(),
        r#type: FqpType::RedfishResource as i32,
        ..Default::default()
    }
    

The stream of response messages, an Update message, contains a vector of Datapoints, each representing a telemetry sample from a source:

0th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11565, timestamp: 2024:10:09:02:20:56.411
1th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11565, timestamp: 2024:10:09:02:20:56.417
...
45th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11565, timestamp: 2024:10:09:02:20:56.687
46th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11565, timestamp: 2024:10:09:02:20:56.695
47th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11430, timestamp: 2024:10:09:02:20:56.704
48th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11430, timestamp: 2024:10:09:02:20:56.710
...
79th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11430, timestamp: 2024:10:09:02:20:56.903
80th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11430, timestamp: 2024:10:09:02:20:56.909

This time series captures telemetry source value changes with millisecond precision.

Build for an ASpeed 2600 BMC board

Download the code to your Linux workstation

~/workspace$ git clone https://github/google/streaming-telemetry-server -b main; cd streaming-telemetry-server/streaming_telemetry

Need install Rust cross build if not done yet, it requires docker be installed first.

~$ cargo install -f cross

Build the target for ASpeed 2600 SoC

~/workspace/streaming-telemetry-server/streaming_telemetry$ cross build --no-default-features --release --target armv7-unknown-linux-gnueabihf
~/workspace/streaming-telemetry-server/streaming_telemetry$ file ../target/armv7-unknown-linux-gnueabihf/release/streaming_telemetry_server
../target/armv7-unknown-linux-gnueabihf/release/streaming_telemetry_server: ELF 32-bit LSB pie executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 3.2.0, BuildID[sha1]=8b7ef3cef9da4c72110cb1b12c9bad135c2b2c60, with debug_info, not stripped

Run on an ASpeed 2600 BMC board

  1. Copy target to target BMC board

    ~/workspace/streaming-telemetry-server/streaming_telemetry$ sshpass -p 0penBmc scp ../target/armv7-unknown-linux-gnueabihf/release/streaming_telemetry_server ../yocto/meta-my-machine/recipes-google/streaming-telemetry-server-systemd/files/streaming_telemetry_server_config.textproto root@bmc:/tmp/
    
  2. Run the gRPC telemetry server

    root@bmc:~# /tmp/streaming_telemetry_server --port 50051 --insecure --config /tmp/streaming_telemetry_server_config.textproto --emconfig /usr/share/entity-manager/configurations/BMCBoard.json,/usr/share/entity-manager/configurations/MotherBoard.json
    Running in insecure mode on port 50051...
    
  3. Setup ssh tunnel from Linux workstation to target BMC

    user1@workstation:~$ sshpass -p 0penBmc ssh -L 50051:localhost:50051 root@bmc
    
  4. Build and run the test gRPC client from Linux workstation

    user1@workstation:~/workspace/streaming-telemetry-server/streaming_telemetry$ cargo build --release --features=build-client
    user1@workstation:~/workspace/streaming-telemetry-server/streaming_telemetry$ ../target/release/streaming_telemetry_client --port 50051 --insecure
    

Test telemetry with threshold

Assume we have a threshold configuration entry defined for a sensor “fantach_fan4_tach” in the server_config.

Run the same command as above, but only monitor the fan tach sensor “fantach_fan4_tach”:

user1@workstation:~/workspace/streaming-telemetry-server/streaming_telemetry$ ../target/release/streaming_telemetry_client --port 50051 --insecure  | grep fantach_fan4_tach

Then try to adjust the sensor value manually from BMC console, the sensor polling rate will vary based on new sensor value (depends on which range it fall into its threshold configuration):

# Ensure the Fan Zone modes are all “Manual” or "Disabled"
root@bmc:~# curl localhost/redfish/v1/Managers/bmc#/Oem/OpenBmc/Fan/FanZones/Zone_0 | grep FanMode
# If not, disable PID control service:
root@bmc:~# systemctl stop phosphor-pid-control

# Get basis value
root@bmc:~# curl http://localhost/redfish/v1/Chassis/ChassisOne/Sensors/fantach_fan4_tach
root@bmc:~# curl http://localhost/redfish/v1/Chassis/ChassisOne/Sensors/fanpwm_fan4_pwm

# Change fanpwm_fan4_pwm
root@bmc:~# curl -X PATCH http://localhost/redfish/v1/Chassis/ChassisOne/Sensors/fanpwm_fan4_pwm -d '{"Reading": 30.0}'

# Read back fan tach to confirm
root@bmc:~# curl http://localhost/redfish/v1/Chassis/ChassisOne/Sensors/fantach_fan4_tach

Run with mTLS scheme

  1. Build from yocto:
    mTLS related dependencies can only be built from yocto at this moment.
    To build from yocto, use the bitbake recipes under yocto folder.

    user1@workstation:/var/bmc/build/my_machine$ bitbake -c compile streaming-telemetry-server
    user1@workstation:/var/bmc/build/my_machine$ sshpass -p 0penBmc scp ../../meta-my_machine/recipes-google/streaming-telemetry-server-systemd/files/streaming_telemetry_server_config.textproto root@bmc:/tmp/
    user1@workstation:/var/bmc/build/my_machine$ sshpass -p 0penBmc scp tmp/work/armv7ahf-vfpv3d16-openbmc-linux-gnueabi/streaming-telemetry-server/0.1.0/build/target/armv7-openbmc-linux-gnueabihf/release/telemetry_server  root@bmc:/tmp/
    
  2. Prepare test keys and test mTLS policy:

  3. Run as gRPC telemetry server from BMC:

    root@bmc:~# LD_LIBRARY_PATH=/tmp /tmp/streaming_telemetry_server \
        --cert /tmp/test-server-cert.pem \
        --key /tmp/test-server-key.pem \
        --cacert /tmp/test-cacert.pem \
        --crls /tmp/crls \
        --policy /tmp/test-mtls.policy \
        --config /tmp/streaming_telemetry_server_config.textproto \
        --emconfig /usr/share/entity-manager/configurations/BMCBoard.json,/usr/share/entity-manager/configurations/MotherBoard.json \
        --port 50051
    
  4. Run gRPC client from Linux workstation:

    user1@workstation:~/workspace/streaming-telemetry-server/streaming_telemetry$ ../target/release/streaming_telemetry_client \
                                                                                    --cert test-client-cert.pem  \
                                                                                    --key test-client-key.pem \
                                                                                    --cacert test-cacert.pem \
                                                                                    --server_dns target_bmc_dns_name \
                                                                                    --port 50051
    

Release note

Version 0.1.0, Sep 19 2024

  1. Features
  • Support subscribe by xpath with wildcard syntax.
  • Support subscribe by server configuration name.
  1. Fixes