[ovs-dev] [PATCH] dpif-netdev: avoid hw_miss_packet_recover() for devices with no support

Eli Britstein elibr at nvidia.com
Mon Nov 22 13:37:35 UTC 2021


On 11/22/2021 3:19 PM, Sriharsha Basavapatna wrote:
> Hi Eli,
>
> On Sun, Nov 21, 2021 at 12:03 PM Eli Britstein via dev
> <ovs-dev at openvswitch.org> wrote:
>> Hi Harsha,
>>
>> It's a clever idea, though have some problems in the implementation. PSB.
> Thanks, please see my response below.
>>
>> On 11/20/2021 11:20 AM, Sriharsha Basavapatna wrote:
>>> The hw_miss_packet_recover() API results in performance degradation, for
>>> ports that are either not offload capable or do not support this specific
>>> offload API.
>>>
>>> For example, in the test configuration shown below, the vhost-user port
>>> does not support offloads and the VF port doesn't support hw_miss offload
>>> API. But because tunnel offload needs to be configured in other bridges
>>> (br-vxlan and br-phy), OVS has been built with -DALLOW_EXPERIMENTAL_API.
>>>
>>>       br-vhost            br-vxlan            br-phy
>>> vhost-user<-->VF    VF-Rep<-->VxLAN       uplink-port
>>>
>>> For every packet between the VF and the vhost-user ports, hw_miss API is
>>> called even though it is not supported by the ports involved. This leads
>>> to significant performance drop (~3x in some cases; both cycles and pps).
>>>
>>> To fix this, return EOPNOTSUPP when this API fails for a device that
>> "To fix" -> "To improve"
>>> doesn't support it and avoid this API on that port for subsequent packets.
>>>
>>> Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna at broadcom.com>
>>> ---
>>>    lib/dpif-netdev-private.h |  2 +-
>>>    lib/dpif-netdev.c         | 29 +++++++++++++++++++++--------
>>>    lib/netdev-offload-dpdk.c |  9 +++++++--
>>>    3 files changed, 29 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/lib/dpif-netdev-private.h b/lib/dpif-netdev-private.h
>>> index 4593649bd..e2a6a9d3a 100644
>>> --- a/lib/dpif-netdev-private.h
>>> +++ b/lib/dpif-netdev-private.h
>>> @@ -46,7 +46,7 @@ dp_netdev_batch_execute(struct dp_netdev_pmd_thread *pmd,
>>>
>>>    int
>>>    dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
>>> -                  odp_port_t port_no,
>>> +                  void *port,
>> void * -> struct tx_port *. use a forward declaration.
>>
>>>                      struct dp_packet *packet,
>>>                      struct dp_netdev_flow **flow);
>>>
>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>>> index 69d7ec26e..207b1961c 100644
>>> --- a/lib/dpif-netdev.c
>>> +++ b/lib/dpif-netdev.c
>>> @@ -434,6 +434,7 @@ struct tx_port {
>>>        long long last_used;
>>>        struct hmap_node node;
>>>        long long flush_time;
>>> +    bool hw_miss_api_supported;
>>>        struct dp_packet_batch output_pkts;
>>>        struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
>>>    };
>>> @@ -6972,6 +6973,7 @@ dp_netdev_add_port_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
>>>        tx->port = port;
>>>        tx->qid = -1;
>>>        tx->flush_time = 0LL;
>>> +    tx->hw_miss_api_supported = true;
>>>        dp_packet_batch_init(&tx->output_pkts);
>>>
>>>        hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port->port_no));
>>> @@ -7327,22 +7329,28 @@ static struct tx_port * pmd_send_port_cache_lookup(
>>>
>>>    inline int
>>>    dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
>>> -                  odp_port_t port_no OVS_UNUSED,
>>> +                  void *port,
>> don't omit OVS_UNUSED. it is for compiling without ALLOW_EXPERIMENTAL_API
> Ok.
>>>                      struct dp_packet *packet,
>>>                      struct dp_netdev_flow **flow)
>>>    {
>>> -    struct tx_port *p OVS_UNUSED;
>>> +    struct tx_port *p = port;
>> no need for this local variable, you get it from the function arguments
> The declaration of dp_netdev_hw_flow() in dpif-netdev-private.h can't
> see 'struct tx_port' since it is defined in dpif_netdev.c. So it needs
> to be a void * argument.

In the H file, use forward declaration, like this:

struct tx_port;

void foo(struct tx_port *port);

Then, in the C file this stack variable can be removed.

>>>        uint32_t mark;
>>>
>>>    #ifdef ALLOW_EXPERIMENTAL_API /* Packet restoration API required. */
>>>        /* Restore the packet if HW processing was terminated before completion. */
>>> -    p = pmd_send_port_cache_lookup(pmd, port_no);
>>> -    if (OVS_LIKELY(p)) {
>>> +    if (OVS_LIKELY(p) && p->hw_miss_api_supported) {
>>>            int err = netdev_hw_miss_packet_recover(p->port->netdev, packet);
>>>
>>> -        if (err && err != EOPNOTSUPP) {
>>> -            COVERAGE_INC(datapath_drop_hw_miss_recover);
>>> -            return -1;
>>> +        if (err) {
>>> +            if (err != EOPNOTSUPP) {
>>> +                COVERAGE_INC(datapath_drop_hw_miss_recover);
>>> +                return -1;
>>> +            } else {
>>> +                /* API unsupported by the port; avoid subsequent calls. */
>>> +                VLOG_DBG("hw_miss_api unsupported: port: %d",
>>> +                         p->port->port_no);
>>> +                p->hw_miss_api_supported = false;
>>> +            }
>>>            }
>>>        }
>>>    #endif
>>> @@ -7394,6 +7402,11 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>>        uint16_t tcp_flags;
>>>        size_t map_cnt = 0;
>>>        bool batch_enable = true;
>>> +    struct tx_port *port = NULL;
>>> +
>>> +#ifdef ALLOW_EXPERIMENTAL_API /* Packet restoration API required. */
>>> +    port = pmd_send_port_cache_lookup(pmd, port_no);
>>> +#endif
>>>
>>>        pmd_perf_update_counter(&pmd->perf_stats,
>>>                                md_is_valid ? PMD_STAT_RECIRC : PMD_STAT_RECV,
>>> @@ -7420,7 +7433,7 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>>            }
>>>
>>>            if (netdev_flow_api && recirc_depth == 0) {
>>> -            if (OVS_UNLIKELY(dp_netdev_hw_flow(pmd, port_no, packet, &flow))) {
>>> +            if (OVS_UNLIKELY(dp_netdev_hw_flow(pmd, port, packet, &flow))) {
>>>                    /* Packet restoration failed and it was dropped, do not
>>>                     * continue processing.
>>>                     */
>>> diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
>>> index 9fee7570a..8ddddbd2e 100644
>>> --- a/lib/netdev-offload-dpdk.c
>>> +++ b/lib/netdev-offload-dpdk.c
>>> @@ -2292,11 +2292,16 @@ netdev_offload_dpdk_hw_miss_packet_recover(struct netdev *netdev,
>>>        odp_port_t vport_odp;
>>>        int ret = 0;
>>>
>>> -    if (netdev_dpdk_rte_flow_get_restore_info(netdev, packet,
>>> -                                              &rte_restore_info, NULL)) {
>>> +    ret = netdev_dpdk_rte_flow_get_restore_info(netdev, packet,
>>> +                                                &rte_restore_info, NULL);
>>> +    if (ret) {
>>>            /* This function is called for every packet, and in most cases there
>>>             * will be no restore info from the HW, thus error is expected.
>>>             */
>>> +        VLOG_DBG_RL(&rl, "flow_get_restore_info failed: %d\n", ret);
>> It is likely that most of the packets will have an "error" when calling
>> this API, if there is nothing to get. See the comment. There was no
>> print on purpose.
> I added it for debugging, will remove.
>>> +        if (ret == -EOPNOTSUPP) {
>> How can we guarantee EOPNOTSUPP is only if indeed not supported and not
>> that the PMD returned it (for this packet only)?
> PMDs must return some other errors for per-packet issues (e.g EINVAL
> for invalid packet format etc). EOPNOTSUPP indicates lack of API
> support. It is already being used to indicate that, as seen in
> netdev-offload.c::hw_miss_packet_recover(). Similarly, a PMD that
> doesn't support this API should return EOPNOTSUPP.

OK. maybe it's better than to push a comment in DPDK to clarify this. 
Currently it just says:

  *   0 on success, a negative errno value otherwise and rte_errno is set.

>
>      return (flow_api && flow_api->hw_miss_packet_recover)
>              ? flow_api->hw_miss_packet_recover(netdev, packet)
>              : EOPNOTSUPP;
>> Maybe it is better to have another API to query (might also need changes
>> in dpdk).
> I don't see the need here, as explained above.
>>> +            return -ret;
>>> +        }
>>>            return 0;
>>>        }
>>>
>> _______________________________________________
>> dev mailing list
>> dev at openvswitch.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&data=04%7C01%7Celibr%40nvidia.com%7C09488c9b40e34562033e08d9adbabe4b%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637731839842570407%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=qdJho5H0UgYU6LlHfPjyv5y6tMCk4CndfsdmmlCgIzw%3D&reserved=0


More information about the dev mailing list