| # The Voyager Streaming Telemetry Server |
| |
| ## Table of Contents |
| |
| - [Introduction](#introduction) |
| - [Why not use Redfish SSE for telemetry](#why-not-use-redfish-sse-for-telemetry) |
| - [Millisecond level telemetry sampling performance](#millisecond-level-telemetry-sampling-performance) |
| - [A unified, direct gRPC telemetry interface without HTTP/Redfish indirection](#a-unified-direct-grpc-telemetry-interface-without-httpredfish-indirection) |
| - [Simplify telemetry collector logic](#simplify-telemetry-collector-logic) |
| - [Threshold based dynamic sampling rate](#threshold-based-dynamic-sampling-rate) |
| - [Components Diagram](#components-diagram) |
| - [Subscription Sequence Diagram](#subscription-sequence-diagram) |
| - [Build for an ASpeed 2600 BMC board](#build-for-an-aspeed-2600-bmc-board) |
| - [Run on an ASpeed 2600 BMC board](#run-on-an-aspeed-2600-bmc-board) |
| |
| ## Introduction |
| |
| This is a standalone gRPC server running on all arena management nodes |
| (Baseboard Management Controllers or BMCs) of a data center machine. Its |
| expected clients are telemetry collectors from the cloud in this diagram (or any |
| other place from the network with credentials): |
| |
|  |
| |
| The collector only depends on the |
| [Voyager Telemetry gRPC proto](streaming_telemetry/proto/voyager_telemetry.proto) |
| definition to interact with this server (server's northbound API). The server |
| polls telemetry sources when subscribed from clients. |
| |
| If defines a generic "TelemetrySource" |
| [API](streaming_telemetry/src/telemetry_source_manager/README.md#southbound-api) |
| for adding new type of sources into its management (server's southbound API). |
| There's no explicit dependency on existing [OpenBMC](https://www.openbmc.org/) |
| services on the arena management node, like you can implement I2C sensors as a |
| type of source by reading I2C sysfs file directly, and/or |
| [MCTP sockets](https://github.com/openbmc/docs/blob/master/designs/mctp/mctp-kernel.md) |
| for sources from host CPUs. |
| |
| ## Why not use Redfish SSE for telemetry |
| |
| OpenBMC uses |
| [Server-Sent Events (SSE)](https://github.com/openbmc/docs/blob/master/designs/redfish-eventservice.md) |
| for server-pushed events, which can be used for machine telemetry. However, in |
| our real-world use cases, especially with the rise of ML hardware usage in data |
| centers, we found that this interface doesn't optimally satisfy our |
| requirements. Here's why: |
| |
| ### 1. Millisecond-level telemetry sampling performance |
| |
| The OpenBMC solution relies on existing telemetry source services, such as the |
| [PSUSensor systemd service](https://github.com/openbmc/dbus-sensors/blob/master/src/PSUSensor.hpp#L62), |
| to poll the hardware and provide telemetry data via D-Bus object property change |
| events. This approach has several limitations: |
| |
| - System-wide telemetry throughput is constrained by D-Bus IPC bandwidth, which |
| is significantly lower than what real hardware sampling can achieve. Hardware |
| sources like I2C sensors can typically be polled at 0.5ms to 5ms intervals. |
| For a system with 100 I2C sensors, that's potentially 20,000 telemetry samples |
| per second, while D-Bus system-wide bandwidth is about one order of magnitude |
| lower. |
| - Modeling each sensor as a D-Bus object and exposing its value changes through |
| D-Bus property changes loses the original sampling timestamp information, thus |
| reduces the fidelity of the data in high-resolution use cases. |
| - The current approach lacks flexibility in polling rates. Ideally, we should be |
| able to poll some sources at high resolution while keeping the majority at low |
| resolution, considering the limited computing resources of a typical BMC card. |
| |
| Our new telemetry server addresses these issues by providing dynamically |
| adjustable, millisecond-level telemetry sampling, instead of the |
| [fixed polling rate at the second level](https://github.com/openbmc/dbus-sensors/blob/master/src/PSUSensorMain.cpp#L534C51-L534C79) |
| used in OpenBMC. |
| |
| ### 2. A unified, direct gRPC telemetry interface without HTTP/Redfish indirection |
| |
| gRPC naturally provides a streaming interface for telemetry. By removing the |
| assumption that telemetry must go through the Redfish interface, we can |
| eliminate layers of indirection and improve performance: |
| |
| - Remove the HTTP request/response layer between cloud collectors and BMC. |
| - Eliminate D-Bus IPC (including message marshaling/unmarshaling) between BMC |
| telemetry source services and the BMC telemetry server. |
| |
| We still use the Redfish specification as our data model, ensuring compatibility |
| with existing telemetry data consumers. |
| |
| ### 3. Simplified telemetry collector logic |
| |
| The current Redfish-based BMC telemetry solution requires multiple transactions |
| between client and server to select desired telemetry sources. It also needs |
| additional rules and optimization efforts for optimal telemetry collection. Our |
| new telemetry server simplifies this process by: |
| |
| - Providing an xpath-like |
| [resource addressing syntax](streaming_telemetry/proto/voyager_telemetry.proto#100) |
| for efficient source selection. |
| - Offering optimized, |
| [statically configured collection](streaming_telemetry/proto/voyager_server_config.proto#39) |
| as the default path. |
| |
| This allows a collector client to use xpath-like syntax or a simple |
| [server configuration name](yocto/meta-my-machine/recipes-google/streaming-telemetry-server-systemd/files/streaming_telemetry_server_config.textproto#3) to subscribe to all |
| interested telemetry sources with desired parameters in a single gRPC call. |
| |
| ### 4. Threshold-based dynamic sampling rate |
| |
| For a more meaningful telemetry solution, we want data collection to focus on |
| sources that need more attention. We achieve this through |
| [threshold-controlled](streaming_telemetry/proto/voyager_telemetry.proto#197) |
| telemetry sampling rates: |
| |
| - The sampling rate of each telemetry source can vary based on its current |
| value. |
| - This can be configured either through a |
| [static server configuration](streaming_telemetry/proto/voyager_server_config.proto) |
| or via a new threshold configuration passed in the client subscription |
| request. |
| |
| This approach ensures that critical data is collected more frequently when |
| needed, while conserving resources during normal operation. |
| |
| ## Components Diagram |
| |
| The core concept of this telemetry server is the Telemetry Source Manager. For |
| more details, please see the |
| [Telemetry Source Manager README](streaming_telemetry/src/telemetry_source_manager/README.md). |
| |
|  |
| |
| ## Subscription Sequence Diagram |
| |
| A subscribe call from a collector client creates a bi-directional stream: |
| |
|  |
| |
| A client can now subscribe to a set of telemetry sources using one of the |
| following [Request](streaming_telemetry/proto/voyager_telemetry.proto#313): |
| |
| 1. **Server configuration name:** |
| |
| ```rust |
| TelemetryRequest { |
| req_id: "req_repairability".into(), |
| req_config_group: "repairability_basic_cfg_group".into(), |
| ..Default::default() |
| } |
| ``` |
| |
| 2. **XPath-like query:** |
| |
| ```rust |
| Fqp { |
| specifier: "/redfish/v1/Chassis/{ChassisId}/Sensors/{SensorId}".into(), |
| identifiers: HashMap::from([ |
| ("ChassisId".into(), "*".into()), |
| ("SensorId".into(), "*".into()), |
| ]), |
| r#type: FqpType::NotSet as i32, |
| ..Default::default() |
| } |
| ``` |
| |
| 3. **Redfish data type:** |
| |
| ```rust |
| Fqp { |
| specifier: "#Sensor.v1_2_0.Sensor".into(), |
| r#type: FqpType::RedfishResource as i32, |
| ..Default::default() |
| } |
| ``` |
| |
| The stream of response messages, an |
| [Update](streaming_telemetry/proto/voyager_telemetry.proto#156) message, |
| contains a vector of |
| [Datapoint](streaming_telemetry/proto/voyager_telemetry.proto#137)s, each |
| representing a telemetry sample from a source: |
| |
| ```sh |
| 0th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11565, timestamp: 2024:10:09:02:20:56.411 |
| 1th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11565, timestamp: 2024:10:09:02:20:56.417 |
| ... |
| 45th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11565, timestamp: 2024:10:09:02:20:56.687 |
| 46th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11565, timestamp: 2024:10:09:02:20:56.695 |
| 47th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11430, timestamp: 2024:10:09:02:20:56.704 |
| 48th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11430, timestamp: 2024:10:09:02:20:56.710 |
| ... |
| 79th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11430, timestamp: 2024:10:09:02:20:56.903 |
| 80th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11430, timestamp: 2024:10:09:02:20:56.909 |
| ``` |
| |
| This time series captures telemetry source value changes with millisecond |
| precision. |
| |
| ## Build for an ASpeed 2600 BMC board |
| |
| Download the code to your Linux workstation |
| |
| ```sh |
| ~/workspace$ git clone https://github/google/streaming-telemetry-server -b main; cd streaming-telemetry-server/streaming_telemetry |
| ``` |
| |
| Need install Rust [cross](https://kerkour.com/rust-cross-compilation) build if |
| not done yet, it requires docker be installed first. |
| |
| ```sh |
| ~$ cargo install -f cross |
| ``` |
| |
| Build the target for ASpeed 2600 SoC |
| |
| ```sh |
| ~/workspace/streaming-telemetry-server/streaming_telemetry$ cross build --no-default-features --release --target armv7-unknown-linux-gnueabihf |
| ~/workspace/streaming-telemetry-server/streaming_telemetry$ file ../target/armv7-unknown-linux-gnueabihf/release/streaming_telemetry_server |
| ../target/armv7-unknown-linux-gnueabihf/release/streaming_telemetry_server: ELF 32-bit LSB pie executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 3.2.0, BuildID[sha1]=8b7ef3cef9da4c72110cb1b12c9bad135c2b2c60, with debug_info, not stripped |
| ``` |
| |
| ## Run on an ASpeed 2600 BMC board |
| |
| 1. Copy target to target BMC board<br> |
| |
| ```sh |
| ~/workspace/streaming-telemetry-server/streaming_telemetry$ sshpass -p 0penBmc scp ../target/armv7-unknown-linux-gnueabihf/release/streaming_telemetry_server ../yocto/meta-my-machine/recipes-google/streaming-telemetry-server-systemd/files/streaming_telemetry_server_config.textproto root@bmc:/tmp/ |
| ``` |
| |
| 2. Run the gRPC telemetry server<br> |
| |
| ```sh |
| root@bmc:~# /tmp/streaming_telemetry_server --port 50051 --insecure --config /tmp/streaming_telemetry_server_config.textproto --emconfig /usr/share/entity-manager/configurations/BMCBoard.json,/usr/share/entity-manager/configurations/MotherBoard.json |
| Running in insecure mode on port 50051... |
| ``` |
| |
| 3. Setup ssh tunnel from Linux workstation to target BMC<br> |
| |
| ```sh |
| user1@workstation:~$ sshpass -p 0penBmc ssh -L 50051:localhost:50051 root@bmc |
| ``` |
| |
| 4. Build and run the test gRPC client from Linux workstation<br> |
| |
| ```sh |
| user1@workstation:~/workspace/streaming-telemetry-server/streaming_telemetry$ cargo build --release --features=build-client |
| user1@workstation:~/workspace/streaming-telemetry-server/streaming_telemetry$ ../target/release/streaming_telemetry_client --port 50051 --insecure |
| ``` |
| |
| ### Test telemetry with threshold |
| |
| Assume we have a threshold configuration entry defined for a sensor |
| ["fantach_fan4_tach"](yocto/meta-my-machine/recipes-google/streaming-telemetry-server-systemd/files/tmp/streaming_telemetry_server_config.textproto#133) in the server_config. |
| <br> |
| |
| Run the same command as above, but only monitor the fan tach sensor |
| "fantach_fan4_tach":<br> |
| |
| ```sh |
| user1@workstation:~/workspace/streaming-telemetry-server/streaming_telemetry$ ../target/release/streaming_telemetry_client --port 50051 --insecure | grep fantach_fan4_tach |
| ``` |
| |
| Then try to adjust the sensor value manually from BMC console, the sensor |
| polling rate will vary based on new sensor value (depends on which range it fall |
| into its threshold configuration):<br> <br> |
| |
| ```sh |
| # Ensure the Fan Zone modes are all “Manual” or "Disabled" |
| root@bmc:~# curl localhost/redfish/v1/Managers/bmc#/Oem/OpenBmc/Fan/FanZones/Zone_0 | grep FanMode |
| # If not, disable PID control service: |
| root@bmc:~# systemctl stop phosphor-pid-control |
| |
| # Get basis value |
| root@bmc:~# curl http://localhost/redfish/v1/Chassis/ChassisOne/Sensors/fantach_fan4_tach |
| root@bmc:~# curl http://localhost/redfish/v1/Chassis/ChassisOne/Sensors/fanpwm_fan4_pwm |
| |
| # Change fanpwm_fan4_pwm |
| root@bmc:~# curl -X PATCH http://localhost/redfish/v1/Chassis/ChassisOne/Sensors/fanpwm_fan4_pwm -d '{"Reading": 30.0}' |
| |
| # Read back fan tach to confirm |
| root@bmc:~# curl http://localhost/redfish/v1/Chassis/ChassisOne/Sensors/fantach_fan4_tach |
| ``` |
| |
| ### Run with mTLS scheme |
| |
| 1. Build from yocto:<br> mTLS related dependencies can only be built from yocto |
| at this moment.<br> To build from yocto, use the bitbake recipes under yocto |
| folder.<br> |
| |
| ```sh |
| user1@workstation:/var/bmc/build/my_machine$ bitbake -c compile streaming-telemetry-server |
| user1@workstation:/var/bmc/build/my_machine$ sshpass -p 0penBmc scp ../../meta-my_machine/recipes-google/streaming-telemetry-server-systemd/files/streaming_telemetry_server_config.textproto root@bmc:/tmp/ |
| user1@workstation:/var/bmc/build/my_machine$ sshpass -p 0penBmc scp tmp/work/armv7ahf-vfpv3d16-openbmc-linux-gnueabi/streaming-telemetry-server/0.1.0/build/target/armv7-openbmc-linux-gnueabihf/release/telemetry_server root@bmc:/tmp/ |
| ``` |
| |
| 2. Prepare test keys and test mTLS policy:<br> |
| |
| 3. Run as gRPC telemetry server from BMC:<br> |
| |
| ```sh |
| root@bmc:~# LD_LIBRARY_PATH=/tmp /tmp/streaming_telemetry_server \ |
| --cert /tmp/test-server-cert.pem \ |
| --key /tmp/test-server-key.pem \ |
| --cacert /tmp/test-cacert.pem \ |
| --crls /tmp/crls \ |
| --policy /tmp/test-mtls.policy \ |
| --config /tmp/streaming_telemetry_server_config.textproto \ |
| --emconfig /usr/share/entity-manager/configurations/BMCBoard.json,/usr/share/entity-manager/configurations/MotherBoard.json \ |
| --port 50051 |
| ``` |
| |
| 4. Run gRPC client from Linux workstation:<br> |
| |
| ```sh |
| user1@workstation:~/workspace/streaming-telemetry-server/streaming_telemetry$ ../target/release/streaming_telemetry_client \ |
| --cert test-client-cert.pem \ |
| --key test-client-key.pem \ |
| --cacert test-cacert.pem \ |
| --server_dns target_bmc_dns_name \ |
| --port 50051 |
| ``` |
| |
| ## Release note |
| |
| ### Version 0.1.0, Sep 19 2024 |
| |
| 1. Features |
| |
| - Support subscribe by xpath with wildcard syntax. |
| - Support subscribe by server configuration name. |
| |
| 1. Fixes |