RDMA_Network_Guide

why涛2023-12-27openGauss

Identifying CX4/CX5 NICs

Run the following command:

lspci |grep Mellanox

Command output:

81:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
81:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

Installing the MLNX Driver

  1. Download the driver package that matches the OS from https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/.

  2. Create a directory and mount the OS image file to this directory. Change the OS image name to the actual one.

    mkdir -p /mnt/iso
    mount openEuler-22.03-LTS-x86_64-dvd.iso /mnt/iso
  3. Configure the OS image source, for example, the local image, to obtain dependencies required during the installation.

    1. Open the image source file.

      vim /etc/yum.repos.d/openEuler.repo
    2. Press i to enter the insert mode and retain only the following content:

      [OS]
      name=OS
      baseurl=file:///mnt/iso
      enabled=1
      gpgcheck=0
    3. Press Esc, type :wq!, and press Enter to save the file and exit.

    4. Cache the software package.

      yum makecache
  4. Upload the driver package to the server and decompress it. Change the driver package name to the actual one.

    tar -zxvf MLNX_OFED_LINUX-5.4-3.7.5.0-openeuler22.03-x86_64.tgz
  5. Go to the driver package directory extracted after the decompression and run the following command to install the driver:

    ./mlnxofedinstall --without-depcheck --without-fw-update --force

    If the system displays a message indicating that the kernel does not support the driver version, run the following command:

    ./mlnxofedinstall --add-kernel-support
  6. Configure the system to automatically start the driver upon system restart.

    chkconfig --add openibd
    /etc/init.d/openibd start
    chkconfig openibd on
  7. Reboot the server after the installation is complete.

Verifying the Installation

  1. Check the RoCE LAG function of the driver.

    1. Check whether the RoCE LAG function is enabled.

      find /sys/ -name roce_lag_enable | xargs cat
      • If the command output is 1, the function is enabled.
      • If the command output is 0 or no command output is displayed, the function is disabled.
      • The function is expected to be disabled. If the function is enabled, go to 1.b.
    2. Disable the RoCE LAG function.

      sed '/load_module mlx5_core/a\ files=`find /sys -name roce_lag_enable`;for file in $files;do echo 0 > $file;done' -i /etc/init.d/openibd
    3. Reboot the node to apply the modification. Then, perform 1.a again to check whether the modification takes effect.

      reboot
  2. Query the driver version.

    ofed_info -s

    If the queried driver version is the same as the version installed in Installing the MLNX Driver, the driver version is correct.

  3. Load the MST tool.

    mst start

    If the following information is displayed, the loading is successful.

    Starting MST (Mellanox Software Tools) driver set
    Loading MST PCI module - Success
    Loading MST PCI configuration module - Success
    Create devices
    Unloading MST PCI module (unused) - Success
  4. Query the device path and network port.

    1. Query the device paths of RoCE and IB cards.

      mst status

      Command output:

      MST modules:
      ------------
          MST PCI module is not loaded
          MST PCI configuration module loaded
      
      MST devices:
      ------------
      /dev/mst/mt4119_pciconf0         - PCI configuration cycles access.
                                         domain:bus:dev.fn=0000:81:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
                                         Chip revision is: 00

      A device path /dev/mst/mst_typeN (N can be 0, 1, 2, ...) enumerated in the MST devices field indicates a CX card. For details about the mapping between mst_type and CX NIC models, see Table 1.

      Table 1 Mapping between mst_type and CX NIC models

      mst_type

      NIC Model

      mt4099_pci_cr

      CX3

      mt4117_pciconf

      CX4-Lx

      mt4119_pciconf

      CX5

      mt4123_pciconf

      CX6

    2. Query the network ports to be checked. Subsequent steps will check all the queried ports.

      ll /dev/mst

      Ports mt4119_pciconf0 and mt4119_pciconf0.1 on the current node will be checked.

  5. Check the firmware version.

    1. Query the firmware version of the RoCE or IB card. In the command, /dev/mst/mt4119_pciconf0 is the device path queried in the previous step. Replace it as required.

      flint -d /dev/mst/mt4119_pciconf0 q

      The command output is as follows:

      Image type:            FS4
      FW Version:            16.31.2006
      FW Release Date:       31.8.2021
      Product Version:       16.31.2006
      Rom Info:              type=UEFI version=14.24.15 cpu=AMD64
                             type=PXE version=3.6.404 cpu=AMD64
      Description:           UID                GuidsNumber
      Base GUID:             ec0d9a0300c152e4        8
      Base MAC:              ec0d9ac152e4            8
      Image VSD:             N/A
      Device VSD:            N/A
      PSID:                  MT_0000000012
      Security Attributes:   N/A
  6. Check the firmware network protocol.

    1. Query the current network protocol. The ETH protocol is used as an example.

      ibdev2netdev -v

      • If the NIC name prefix is ib, the current network protocol is IB. Go to 6.b.
      • If the NIC name prefix is en, the current network protocol is ETH. Go to 7.
    2. Query the values of LINK_TYPE_P1 and LINK_TYPE_P2. The following uses /dev/mst/mt4123_pciconf0 as an example.

      mlxconfig -d /dev/mst/mt4123_pciconf0 q|grep LINK_TYPE_P1
      mlxconfig -d /dev/mst/mt4123_pciconf0 q|grep LINK_TYPE_P2
      • If the command output is empty, the network protocol cannot be changed in the current environment. In this case, change the environment.
      • If the query result is displayed, the network protocol can be modified.
        • The queried values are expected to be ETH(2). If so, go to 7.

        • If the queried values are IB(1), go to 6.c.

    3. Change the values of LINK_TYPE_P1 and LINK_TYPE_P2. The following uses /dev/mst/mt4123_pciconf0 as an example.

      mlxconfig -d /dev/mst/mt4123_pciconf0 s LINK_TYPE_P1=2
      mlxconfig -d /dev/mst/mt4123_pciconf0 s LINK_TYPE_P2=2

    4. Run the reboot command to reboot the system and perform 6.b to verify that the modification is successful.

  7. Verify the RDMA network.

    Run the following command on the server node:

    ib_send_bw -d mlx5_1

    Run the following command on the client node (xx.xx.xx.xx indicates the IP address of the server node):

    ib_send_bw -d mlx5_1 xx.xx.xx.xx
  8. (Optional) Set firmware options.

    NOTE: You are recommended to perform this step to reduce the network delay.

    1. Query the value of the CX card firmware option PCI_WR_ORDERING.

      Take /dev/mst/mt4119_pciconf0 as an example. Query the firmware settings of the two ports of the device. In the query result, the value of per_mkey is expected to be 1. If not, go to 8.b.

      mlxconfig -d /dev/mst/mt4119_pciconf0 q | grep PCI_WR_ORDERING
      mlxconfig -d /dev/mst/mt4119_pciconf0.1 q | grep PCI_WR_ORDERING

    2. Set the firmware option PCI_WR_ORDERING for the two ports of a CX5 card, and run the reboot command to restart the system. After the environment is restored, perform 8 again to check whether the modification is successful.

      mlxconfig -y -d /dev/mst/mt4119_pciconf0 s PCI_WR_ORDERING=1

      mlxconfig -y -d /dev/mst/mt4119_pciconf0.1 s PCI_WR_ORDERING=1

Configuring NIC IP Addresses

  1. View the association between Ethernet devices and IB devices/ports.

    ibdev2netdev -v
    • Name of the NIC associated with the NIC driver client mlx5_0 on the current node: enp24s0f0
    • Name of the NIC associated with the NIC driver client mlx5_1 on the current node: enp24s0f1

  2. Check the NIC status.

    ifconfig -a

    If the four states are normal, the NIC can be used properly.

    • UP indicates that the NIC is enabled.
    • RUNNING indicates that the network cable of the NIC is connected.
    • MULTICAST indicates that multicasting is supported.
    • MTU 1500 indicates the maximum transmission unit.
  3. Configure the NIC IP address based on your environment. The following describes how to add the NIC IP address in the /etc/sysconfig/network-scripts/ifcfg-enp24s0f0 configuration file. Run systemctl restart network.service to restart the application.

    After the configuration is complete, check the NIC status by referring to 2.

Common IB Commands

Table 1 Common IB commands

Command

Description

lspci |grep Mell

Checks whether an IB card exists on the host (by searching for the vendor name Mellanox).

ibstatus

Views IB card information, including the link status, port rate, and port GUID.

ibstat

Has similar functions to those of ibstatus.

ofed_info -s

Queries the version of the installed driver.

ibv_devinfo

Queries the IB device information on the current node.

ibqueryerrors -C mlx4_0 -P 1

Queries the statistics of each port on the current IB network.

perfquery

Queries whether packet loss occurs on the IB card port and whether any port symbol is incorrect.

ibv_devices

Queries the IB card of the current node.

ibdump

Captures packets at the IB layer. It is provided by Mellanox.

ethtool --set-priv-flags eth-s0 sniffer on

Enables the sniffer function so that tcpdump can be used to capture packets.

ib_atomic_bw

Calculates the bandwidth of RDMA atomic transactions between a pair of machines (one server and one client). It obtains the time for receiving complete messages through CPU sampling to calculate the bandwidth. It supports two-way tests and allows you to change the MTU size, TX size, number of iterations, and message size. For more usage, see the -a parameter.

ib_atomic_lat

Calculates the delay of atomic transactions between a pair of machines in certain RDMA message size. The client sends RDMA atomic operations to the server, samples the CPU clock to obtain the time when all the messages are received, and calculates the delay.

ib_read_bw

Calculates the bandwidth of RDMA read operations between a pair of machines.

ib_read_lat

Calculates the read operation delay between a pair of machines in certain RDMA message size.

ib_send_bw -d mlx5_1

Calculates the RDMA send operation bandwidth between a pair of machines.

ib_send_lat

Calculates the send operation delay between a pair of machines in certain RDMA message size.

ib_write_bw

Calculates the RDMA write operation bandwidth between a pair of machines.

ib_write_lat

Calculates the write operation delay between a pair of machines in certain RDMA message size.

raw_ethernet_bw

Calculates the send bandwidth between a pair of machines.

raw_ethernet_lat

Calculates the delay for sending messages of a certain size between a pair of machines.

rping

Checks whether the RDMA CM connection is normal.