Adding update shmem support for power smoothing V2 properties

fixes jira https://jirasw.nvidia.com/browse/DGXOPENBMC-21509

Signed-off-by: Raghul Rajakumar <rrajakumar@nvidia.com>
1 file changed
tree: f1b45056ff9102a9b7e524e89e514e2a1454ae9d
  1. .gitlab/
  2. common/
  3. debug-token/
  4. dot/
  5. example/
  6. libnsm/
  7. mockupResponder/
  8. nsmd/
  9. nsmtool/
  10. requester/
  11. services/
  12. subprojects/
  13. tools/
  14. tracepoints/
  15. .beautysh-ignore
  16. .black-ignore
  17. .clang-format
  18. .coderabbit.yaml
  19. .eslintignore
  20. .flake8-ignore
  21. .gitignore
  22. .isort-ignore
  23. .markdownlint-ignore
  24. .prettierignore
  25. .shellcheck-ignore
  26. LICENSE
  27. meson.build
  28. meson_options.txt
  29. README.md
README.md

nsmd - Nvidia System Management Daemon

How to build

Install dependencies

sudo apt install build-essential gcc-13 g++-13 python3-dev nlohmann-json3-dev
pip install --user meson ninja

Install Boost

sudo apt install libboost1.83-all-dev # for Ubuntu 22.04

or

sudo apt install libboost1.84-all-dev # for Ubuntu 24.04

or if it not installed, download and install it from source.

wget https://downloads.sourceforge.net/project/boost/boost/1.84.0/boost_1_84_0.tar.gz
tar -xzf boost_1_84_0.tar.gz
cd boost_1_84_0
./bootstrap.sh --prefix=/usr/local
./b2 install

Copy libmctp header for local development

git archive --remote=ssh://git@gitlab-master.nvidia.com:12051/dgx/bmc/libmctp.git develop libmctp-externals.h | tar -x -C common/

Configure and build with Meson

# Configure Meson build with debug options and compiler flags (copied from openbmc-build-scripts repo)
meson setup --reconfigure -Db_sanitize=address,undefined -Db_lundef=true -Dwerror=true -Dwarning_level=3 -Db_colorout=never -Ddebug=true -Doptimization=g -Dcpp_args="-DBOOST_USE_VALGRIND -Wno-error=invalid-constexpr -Wno-invalid-constexpr -Werror=uninitialized -Wno-error=maybe-uninitialized -Werror=strict-aliasing" builddir
# Build all targets
ninja -C builddir

Build and run unit tests

# Run all unit tests
meson test -C builddir
# Run specific unit test
meson test -C builddir nsmChassis_test

Troubleshooting Build Issues

sdbusplus Version Mismatch

If you encounter sdbusplus build errors, verify that the revision in subprojects/sdbusplus.wrap matches the version specified in the openbmc-build-scripts repository. Version mismatches can cause build failures.

Updating Subproject Dependencies

For other subproject-related errors, you can update all subproject repositories to their latest commits using:

cd subprojects

find -L . -type d -name ".git" | while read gitdir; do
    repo=$(dirname "$gitdir")
    echo "Pulling updates in $repo"
    cd "$repo"
    git pull
    cd - > /dev/null
done

Unit Tests Debugging

Debugging with GDB in console

# Debug all tests
meson test -C builddir --gdb

# Debug specific test
meson test -C builddir nsmChassis_test --gdb

Debugging with GDB in VSCode/Cursor

  1. Configure launch.json
{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Debug file with Meson",
            "type": "cppdbg",
            "request": "launch",
            "program": "${workspaceFolder}/builddir/${relativeFileDirname}/${fileBasenameNoExtension}",
            "cwd": "${workspaceFolder}/builddir/${relativeFileDirname}",
            "preLaunchTask": "Compile meson test"
        }
    ]
}
  1. Configure tasks.json
{
    "version": "2.0.0",
    "tasks": [
        {
            "label": "Compile meson test",
            "type": "shell",
            "command": "meson compile -C builddir ${fileBasenameNoExtension}",
            "group": "build",
        }
    ]
}
  1. Open the unit test file you want to debug in VSCode/Cursor
  2. Set breakpoints in the code where needed
  3. Press F5 to start debugging the test

Installing clang-format-19 for CI Usage

To ensure code consistency and formatting standards in the CI pipeline, clang-format-19 needs to be installed. Follow the steps below to install clang-format-19 on your system:

# Update the package list
sudo apt update

# Install clang-format-19
sudo apt install clang-format-19

This will install clang-format-19 on your system, enabling it for use in the CI pipeline.

Using clang-format-19 for all changed files before commit

To automatically format your code before each commit, create a pre-commit hook with the following steps:

cat > .git/hooks/pre-commit << EOL
#!/bin/sh

# Get list of staged files that are C/C++ source files
files=$(git diff --cached --name-only --diff-filter=ACMR | grep ".*\.[ch]\(pp\)\?$")

if [ -n "$files" ]; then
    # Format the files
    clang-format-19 -i $files
    
    # Add the formatted files back to staging
    git add $files
    
    # Check if any files were modified after formatting
    if ! git diff --cached --quiet; then
        echo "Formatted C/C++ files were automatically fixed up"
    fi
fi

exit 0
EOL
chmod +x .git/hooks/pre-commit

Progress Counters

The NSM daemon tracks various sensor polling operations using progress counters. These counters are stored in a memory-mapped file descriptor (memfd) and can be accessed via D-Bus for duming, monitoring and debugging purposes.

Counter Types and When They Are Incremented

Each counter type tracks a specific aspect of sensor polling operations:

1. Priority

  • Description: Tracks successful updates of priority sensors
  • When incremented: After each successful priority sensor update during the priority polling phase (every 150ms)
  • Location: sensorManager.cpp::pollPrioritySensors()
  • Purpose: Monitor high-frequency critical sensor updates

2. GpuPerformanceMonitoring

  • Description: Tracks GPU Performance Monitoring (GPM) sensor updates
  • When incremented: After each successful GPM sensor update (NVDEC, NVJPG utilization metrics)
  • Polling interval: 1000ms
  • Location: nsmGpmOemFactory.cpp when creating GPM sensors
  • Purpose: Monitor GPU-specific performance metric collection

3. LongRunning

  • Description: Tracks completion of long-running sensor operations
  • When incremented: After a long-running sensor operation completes
  • Location: sensorManager.cpp::updateLongRunningSensor()
  • Purpose: Monitor operations that may take extended time and potentially return events as second responses (e.g., throttle duration sensors)

4. Static

  • Description: Tracks one-time static sensor updates
  • When incremented: After each static sensor update
  • Location: sensorManager.cpp::pollNonPrioritySensors() when pollingType == Static
  • Purpose: Monitor sensors with values that don't change during runtime (polled once and removed from queue upon success)

5. RoundRobin

  • Description: Tracks non-priority sensor updates in round-robin fashion
  • When incremented: After each non-priority sensor update during round-robin polling
  • Location: sensorManager.cpp::pollNonPrioritySensors() when pollingType == RoundRobin
  • Purpose: Monitor sensors polled in circular queue fashion when time permits after priority sensors

6. PriorityTimeExceeded

  • Description: Tracks when priority polling exceeds its time window
  • When incremented: When priority sensor polling takes longer than SENSOR_POLLING_TIME (typically 150ms)
  • Location: sensorManager.cpp::pollPrioritySensors() when (t1 - t0) > pollingTimeInUsec
  • Purpose: Detect performance issues where priority polling is taking too long and may affect system responsiveness

7. PostPatch

  • Description: Tracks post-patch I/O operations
  • When incremented: After each post-patch I/O operation on the device
  • Location: nsmDevice.cpp::postPatchIO()
  • Purpose: Monitor operations that occur after device firmware updates or patches to verify device state

8. Event

  • Description: Tracks NSM event processing
  • When incremented: After each NSM event is received and processed by the event dispatcher
  • Location: nsmEvent.cpp::DelegatingEventHandler::delegate()
  • Purpose: Monitor asynchronous notifications from devices (e.g., long-running operation completion, state changes)

9. Error

  • Description: Tracks failed operations (excluding timeouts)
  • When incremented: When any sensor update or operation fails with an error code other than NSM_SUCCESS or NSM_SW_ERROR_TIMEOUT
  • Location: progressCounters.cpp::increment() when rc != NSM_SUCCESS and rc != NSM_SW_ERROR_TIMEOUT
  • Purpose: Monitor general error conditions during polling operations

10. Timeout

  • Description: Tracks timeout errors
  • When incremented: When a sensor update or operation times out (NSM_SW_ERROR_TIMEOUT)
  • Location: progressCounters.cpp::increment() when rc == NSM_SW_ERROR_TIMEOUT
  • Purpose: Monitor operations where devices did not respond within the expected time window

Configuration Options

Progress counters can be configured via meson options:

  • progressCounter: Enable/disable progress counter functionality (default: enabled)
  • sensor-progress-counters-dump-count-threshold: Number of counter updates before dumping to memfd (default: 100000)
  • sensor-progress-counters-dump-time-threshold: Time threshold in microseconds before dumping (default: 600000000 = 10 minutes)
  • sensor-progress-counters-memfd-size: Size of the memory-mapped file in bytes (default: 65536)

Accessing Counter Data

Counter data is exposed via D-Bus at:

/xyz/openbmc_project/progress_counters/<device_eid>

Use the nsmProgressCountersReader tool to read counter data:

# Read counters for all devices
nsmProgressCountersReader

# Read counters for specific device
nsmProgressCountersReader <device_eid>

Adding Support for a New Counter Type

To add a new progress counter type, follow these steps:

1. Update the Enum Definition

Add your new counter type to nsmd/nsmProgressCounters/progressCounterType.hpp:

enum class ProgressCounterType
{
    Priority,
    GpuPerformanceMonitoring,
    // ... existing counters ...
    YourNewCounter,  // Add here, before EnumCount (must be last)
    EnumCount,
};

Important: Always add new counters before EnumCount, as EnumCount must remain the last entry for the CountersCount calculation.

2. Update Documentation

Add comprehensive documentation for your new counter in the nsmd/nsmProgressCounters/progressCounterType.hpp file:

/**
 * @brief Your new counter description
 *
 * Incremented when: Describe when this counter is incremented
 *
 * Location: File.cpp::functionName()
 */
YourNewCounter,

Add your new counter to the “Counter Types and When They Are Incremented” section in this README with:

  • Description
  • When it's incremented
  • Location in code
  • Purpose

3. Update Counter Names Map

Add your counter name to the counterNames array in nsmd/nsmProgressCounters/progressCounterReader.cpp:

static constexpr std::array<std::string_view, CountersCount> counterNames = {
    "Priority",  "GPM",        "LongRunning",
    "Static",    "RoundRobin", "PriorityTimeExceeded",
    "PostPatch", "Event",      "Error",
    "Timeout",   "YourNewCounter",  // Add your counter name here
};

Important: The order must match the enum order in ProgressCounterType. This array is used by nsmProgressCountersReader to display counter names in CSV output.

4. Increment the Counter

In the appropriate location in your code, increment the counter:

// For successful operations
nsmDevice->progressCounters.increment(ProgressCounterType::YourNewCounter, rc, timestamp);

// Or directly without return code checking
nsmDevice->progressCounters.increment(ProgressCounterType::YourNewCounter, timestamp);

Data Structure

Counters are stored in a packed structure for efficient memory usage:

struct __attribute__((packed)) CounterDataRow
{
    uint32_t key;           // Iteration/dump key
    uint64_t timestamp;     // Timestamp in microseconds
    CountersArray counters; // Array of counter values
};

The data rotates in the memfd using key % maxRows to ensure bounded memory usage.

Artifacts

Successful build should generate three binary artifacts.

  1. nsmd (NSM Daemon)
  2. nsmtool (NSM Requester utility)
  3. nsmMockupResponder (NSM Endpoint Mockup Responder)

nsmd

A Daemon that can discover NSM endpoint, gather telemetry data from the endpoints, and can publish them to D-Bus or similar IPC services, for consumer services like bmcweb.

nsmtool

nsmtool is a client tool that acts as a NSM requester which can be invoked from the BMC. nsmtool sends the request message and parse the response message & display it in readable format.

nsmMockupResponder

A mockup NSM responder that can be used for development purpose. Its primary usage is to test nsmd and nsmtool features on an emulator like QEMU.

Follow this steps to run nsmMockupResponder: Step 1 - On the QEMU instance, restart the nsmd service.

Step 2 Assign an address to the loopback (lo) interface $ mctp addr add 12 dev lo

Step 3 Immediately start the mock responder using the assigned address $ nsmMockupResponder -v -d Baseboard -i 0 -e 12

Run Step 3 right after Step 2. If there is any delay, nsmd will fail to detect the endpoint. If detection fails, repeat all steps from the beginning.