blob: 89f4c8d1bca0f923114b32c08e932c729b6784a3 [file] [log] [blame] [view] [edit]
# The Voyager Streaming Telemetry Server
## Table of Contents
- [Introduction](#introduction)
- [Why not use Redfish SSE for telemetry](#why-not-use-redfish-sse-for-telemetry)
- [Millisecond level telemetry sampling performance](#millisecond-level-telemetry-sampling-performance)
- [A unified, direct gRPC telemetry interface without HTTP/Redfish indirection](#a-unified-direct-grpc-telemetry-interface-without-httpredfish-indirection)
- [Simplify telemetry collector logic](#simplify-telemetry-collector-logic)
- [Threshold based dynamic sampling rate](#threshold-based-dynamic-sampling-rate)
- [Components Diagram](#components-diagram)
- [Subscription Sequence Diagram](#subscription-sequence-diagram)
- [Build for an ASpeed 2600 BMC board](#build-for-an-aspeed-2600-bmc-board)
- [Run on an ASpeed 2600 BMC board](#run-on-an-aspeed-2600-bmc-board)
## Introduction
This is a standalone gRPC server running on all arena management nodes
(Baseboard Management Controllers or BMCs) of a data center machine. Its
expected clients are telemetry collectors from the cloud in this diagram (or any
other place from the network with credentials):
![System Diagram](docs/images/system_diagram.jpg)
The collector only depends on the
[Voyager Telemetry gRPC proto](streaming_telemetry/proto/voyager_telemetry.proto)
definition to interact with this server (server's northbound API). The server
polls telemetry sources when subscribed from clients.
If defines a generic "TelemetrySource"
[API](streaming_telemetry/src/telemetry_source_manager/README.md#southbound-api)
for adding new type of sources into its management (server's southbound API).
There's no explicit dependency on existing [OpenBMC](https://www.openbmc.org/)
services on the arena management node, like you can implement I2C sensors as a
type of source by reading I2C sysfs file directly, and/or
[MCTP sockets](https://github.com/openbmc/docs/blob/master/designs/mctp/mctp-kernel.md)
for sources from host CPUs.
## Why not use Redfish SSE for telemetry
OpenBMC uses
[Server-Sent Events (SSE)](https://github.com/openbmc/docs/blob/master/designs/redfish-eventservice.md)
for server-pushed events, which can be used for machine telemetry. However, in
our real-world use cases, especially with the rise of ML hardware usage in data
centers, we found that this interface doesn't optimally satisfy our
requirements. Here's why:
### 1. Millisecond-level telemetry sampling performance
The OpenBMC solution relies on existing telemetry source services, such as the
[PSUSensor systemd service](https://github.com/openbmc/dbus-sensors/blob/master/src/PSUSensor.hpp#L62),
to poll the hardware and provide telemetry data via D-Bus object property change
events. This approach has several limitations:
- System-wide telemetry throughput is constrained by D-Bus IPC bandwidth, which
is significantly lower than what real hardware sampling can achieve. Hardware
sources like I2C sensors can typically be polled at 0.5ms to 5ms intervals.
For a system with 100 I2C sensors, that's potentially 20,000 telemetry samples
per second, while D-Bus system-wide bandwidth is about one order of magnitude
lower.
- Modeling each sensor as a D-Bus object and exposing its value changes through
D-Bus property changes loses the original sampling timestamp information, thus
reduces the fidelity of the data in high-resolution use cases.
- The current approach lacks flexibility in polling rates. Ideally, we should be
able to poll some sources at high resolution while keeping the majority at low
resolution, considering the limited computing resources of a typical BMC card.
Our new telemetry server addresses these issues by providing dynamically
adjustable, millisecond-level telemetry sampling, instead of the
[fixed polling rate at the second level](https://github.com/openbmc/dbus-sensors/blob/master/src/PSUSensorMain.cpp#L534C51-L534C79)
used in OpenBMC.
### 2. A unified, direct gRPC telemetry interface without HTTP/Redfish indirection
gRPC naturally provides a streaming interface for telemetry. By removing the
assumption that telemetry must go through the Redfish interface, we can
eliminate layers of indirection and improve performance:
- Remove the HTTP request/response layer between cloud collectors and BMC.
- Eliminate D-Bus IPC (including message marshaling/unmarshaling) between BMC
telemetry source services and the BMC telemetry server.
We still use the Redfish specification as our data model, ensuring compatibility
with existing telemetry data consumers.
### 3. Simplified telemetry collector logic
The current Redfish-based BMC telemetry solution requires multiple transactions
between client and server to select desired telemetry sources. It also needs
additional rules and optimization efforts for optimal telemetry collection. Our
new telemetry server simplifies this process by:
- Providing an xpath-like
[resource addressing syntax](streaming_telemetry/proto/voyager_telemetry.proto#100)
for efficient source selection.
- Offering optimized,
[statically configured collection](streaming_telemetry/proto/voyager_server_config.proto#39)
as the default path.
This allows a collector client to use xpath-like syntax or a simple
[server configuration name](yocto/meta-my-machine/recipes-google/streaming-telemetry-server-systemd/files/streaming_telemetry_server_config.textproto#3) to subscribe to all
interested telemetry sources with desired parameters in a single gRPC call.
### 4. Threshold-based dynamic sampling rate
For a more meaningful telemetry solution, we want data collection to focus on
sources that need more attention. We achieve this through
[threshold-controlled](streaming_telemetry/proto/voyager_telemetry.proto#197)
telemetry sampling rates:
- The sampling rate of each telemetry source can vary based on its current
value.
- This can be configured either through a
[static server configuration](streaming_telemetry/proto/voyager_server_config.proto)
or via a new threshold configuration passed in the client subscription
request.
This approach ensures that critical data is collected more frequently when
needed, while conserving resources during normal operation.
## Components Diagram
The core concept of this telemetry server is the Telemetry Source Manager. For
more details, please see the
[Telemetry Source Manager README](streaming_telemetry/src/telemetry_source_manager/README.md).
![Components Diagram](docs/images/components_diagram.jpg)
## Subscription Sequence Diagram
A subscribe call from a collector client creates a bi-directional stream:
![Subscription Sequence Diagram](docs/images/subscription_sequence_diagram.jpg)
A client can now subscribe to a set of telemetry sources using one of the
following [Request](streaming_telemetry/proto/voyager_telemetry.proto#313):
1. **Server configuration name:**
```rust
TelemetryRequest {
req_id: "req_repairability".into(),
req_config_group: "repairability_basic_cfg_group".into(),
..Default::default()
}
```
2. **XPath-like query:**
```rust
Fqp {
specifier: "/redfish/v1/Chassis/{ChassisId}/Sensors/{SensorId}".into(),
identifiers: HashMap::from([
("ChassisId".into(), "*".into()),
("SensorId".into(), "*".into()),
]),
r#type: FqpType::NotSet as i32,
..Default::default()
}
```
3. **Redfish data type:**
```rust
Fqp {
specifier: "#Sensor.v1_2_0.Sensor".into(),
r#type: FqpType::RedfishResource as i32,
..Default::default()
}
```
The stream of response messages, an
[Update](streaming_telemetry/proto/voyager_telemetry.proto#156) message,
contains a vector of
[Datapoint](streaming_telemetry/proto/voyager_telemetry.proto#137)s, each
representing a telemetry sample from a source:
```sh
0th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11565, timestamp: 2024:10:09:02:20:56.411
1th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11565, timestamp: 2024:10:09:02:20:56.417
...
45th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11565, timestamp: 2024:10:09:02:20:56.687
46th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11565, timestamp: 2024:10:09:02:20:56.695
47th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11430, timestamp: 2024:10:09:02:20:56.704
48th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11430, timestamp: 2024:10:09:02:20:56.710
...
79th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11430, timestamp: 2024:10:09:02:20:56.903
80th datapoint, @odata.id: /redfish/v1/Chassis/ChassisTwo/Sensors/fantach_fan5_tach, value: 11430, timestamp: 2024:10:09:02:20:56.909
```
This time series captures telemetry source value changes with millisecond
precision.
## Build for an ASpeed 2600 BMC board
Download the code to your Linux workstation
```sh
~/workspace$ git clone https://github/google/streaming-telemetry-server -b main; cd streaming-telemetry-server/streaming_telemetry
```
Need install Rust [cross](https://kerkour.com/rust-cross-compilation) build if
not done yet, it requires docker be installed first.
```sh
~$ cargo install -f cross
```
Build the target for ASpeed 2600 SoC
```sh
~/workspace/streaming-telemetry-server/streaming_telemetry$ cross build --no-default-features --release --target armv7-unknown-linux-gnueabihf
~/workspace/streaming-telemetry-server/streaming_telemetry$ file ../target/armv7-unknown-linux-gnueabihf/release/streaming_telemetry_server
../target/armv7-unknown-linux-gnueabihf/release/streaming_telemetry_server: ELF 32-bit LSB pie executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 3.2.0, BuildID[sha1]=8b7ef3cef9da4c72110cb1b12c9bad135c2b2c60, with debug_info, not stripped
```
## Run on an ASpeed 2600 BMC board
1. Copy target to target BMC board<br>
```sh
~/workspace/streaming-telemetry-server/streaming_telemetry$ sshpass -p 0penBmc scp ../target/armv7-unknown-linux-gnueabihf/release/streaming_telemetry_server ../yocto/meta-my-machine/recipes-google/streaming-telemetry-server-systemd/files/streaming_telemetry_server_config.textproto root@bmc:/tmp/
```
2. Run the gRPC telemetry server<br>
```sh
root@bmc:~# /tmp/streaming_telemetry_server --port 50051 --insecure --config /tmp/streaming_telemetry_server_config.textproto --emconfig /usr/share/entity-manager/configurations/BMCBoard.json,/usr/share/entity-manager/configurations/MotherBoard.json
Running in insecure mode on port 50051...
```
3. Setup ssh tunnel from Linux workstation to target BMC<br>
```sh
user1@workstation:~$ sshpass -p 0penBmc ssh -L 50051:localhost:50051 root@bmc
```
4. Build and run the test gRPC client from Linux workstation<br>
```sh
user1@workstation:~/workspace/streaming-telemetry-server/streaming_telemetry$ cargo build --release --features=build-client
user1@workstation:~/workspace/streaming-telemetry-server/streaming_telemetry$ ../target/release/streaming_telemetry_client --port 50051 --insecure
```
### Test telemetry with threshold
Assume we have a threshold configuration entry defined for a sensor
["fantach_fan4_tach"](yocto/meta-my-machine/recipes-google/streaming-telemetry-server-systemd/files/tmp/streaming_telemetry_server_config.textproto#133) in the server_config.
<br>
Run the same command as above, but only monitor the fan tach sensor
"fantach_fan4_tach":<br>
```sh
user1@workstation:~/workspace/streaming-telemetry-server/streaming_telemetry$ ../target/release/streaming_telemetry_client --port 50051 --insecure | grep fantach_fan4_tach
```
Then try to adjust the sensor value manually from BMC console, the sensor
polling rate will vary based on new sensor value (depends on which range it fall
into its threshold configuration):<br> <br>
```sh
# Ensure the Fan Zone modes are all “Manual” or "Disabled"
root@bmc:~# curl localhost/redfish/v1/Managers/bmc#/Oem/OpenBmc/Fan/FanZones/Zone_0 | grep FanMode
# If not, disable PID control service:
root@bmc:~# systemctl stop phosphor-pid-control
# Get basis value
root@bmc:~# curl http://localhost/redfish/v1/Chassis/ChassisOne/Sensors/fantach_fan4_tach
root@bmc:~# curl http://localhost/redfish/v1/Chassis/ChassisOne/Sensors/fanpwm_fan4_pwm
# Change fanpwm_fan4_pwm
root@bmc:~# curl -X PATCH http://localhost/redfish/v1/Chassis/ChassisOne/Sensors/fanpwm_fan4_pwm -d '{"Reading": 30.0}'
# Read back fan tach to confirm
root@bmc:~# curl http://localhost/redfish/v1/Chassis/ChassisOne/Sensors/fantach_fan4_tach
```
### Run with mTLS scheme
1. Build from yocto:<br> mTLS related dependencies can only be built from yocto
at this moment.<br> To build from yocto, use the bitbake recipes under yocto
folder.<br>
```sh
user1@workstation:/var/bmc/build/my_machine$ bitbake -c compile streaming-telemetry-server
user1@workstation:/var/bmc/build/my_machine$ sshpass -p 0penBmc scp ../../meta-my_machine/recipes-google/streaming-telemetry-server-systemd/files/streaming_telemetry_server_config.textproto root@bmc:/tmp/
user1@workstation:/var/bmc/build/my_machine$ sshpass -p 0penBmc scp tmp/work/armv7ahf-vfpv3d16-openbmc-linux-gnueabi/streaming-telemetry-server/0.1.0/build/target/armv7-openbmc-linux-gnueabihf/release/telemetry_server root@bmc:/tmp/
```
2. Prepare test keys and test mTLS policy:<br>
3. Run as gRPC telemetry server from BMC:<br>
```sh
root@bmc:~# LD_LIBRARY_PATH=/tmp /tmp/streaming_telemetry_server \
--cert /tmp/test-server-cert.pem \
--key /tmp/test-server-key.pem \
--cacert /tmp/test-cacert.pem \
--crls /tmp/crls \
--policy /tmp/test-mtls.policy \
--config /tmp/streaming_telemetry_server_config.textproto \
--emconfig /usr/share/entity-manager/configurations/BMCBoard.json,/usr/share/entity-manager/configurations/MotherBoard.json \
--port 50051
```
4. Run gRPC client from Linux workstation:<br>
```sh
user1@workstation:~/workspace/streaming-telemetry-server/streaming_telemetry$ ../target/release/streaming_telemetry_client \
--cert test-client-cert.pem \
--key test-client-key.pem \
--cacert test-cacert.pem \
--server_dns target_bmc_dns_name \
--port 50051
```
## Release note
### Version 0.1.0, Sep 19 2024
1. Features
- Support subscribe by xpath with wildcard syntax.
- Support subscribe by server configuration name.
1. Fixes