[ovs-dev] [PATCH V7 06/13] dpif-netdev: Add HW miss packet state recover logic.

Fri Jun 25 15:25:26 UTC 2021

On 6/25/21 3:56 PM, Eli Britstein wrote:
> 
> On 6/25/2021 3:09 PM, Balazs Nemeth wrote:
>> *External email: Use caution opening links or attachments*
>>
>>
>> Ilya,
>>
>> Sure, I'll rebase my patch and also ensure netdev_is_flow_api_enabled()
>> is called once per batch. I don't expect any issue with that.

Thanks, Balazs!  The series is applied now, so you can work on updated patch.

> 
> Thanks Balazs.
> 
>>
>> Regards,
>> Balazs
>>
>> On Fri, Jun 25, 2021 at 2:00 PM Ilya Maximets <i.maximets at ovn.org <mailto:i.maximets at ovn.org>> wrote:
>>
>>     On 6/25/21 1:04 PM, Ferriter, Cian wrote:
>>     > Hi Eli,
>>     >
>>     > I have some concerns about how we plug the packet state recover logic into dfc_processing() here. My concerns are inline below.
>>     >
>>     > I'm concerned that we are hurting performance in the partial HWOL case, as this patchset is introducing new direct (non-inline) function calls per packet to the software datapath. We can reduce performance impact in this area, see specific suggestions below.
>>     >
>>     > Thanks,
>>     > Cian
>>     >
>>     >> -----Original Message-----
>>     >> From: Eli Britstein <elibr at nvidia.com <mailto:elibr at nvidia.com>>
>>     >> Sent: Wednesday 23 June 2021 16:53
>>     >> To: dev at openvswitch.org <mailto:dev at openvswitch.org>; Ilya Maximets <i.maximets at ovn.org <mailto:i.maximets at ovn.org>>
>>     >> Cc: Gaetan Rivet <gaetanr at nvidia.com <mailto:gaetanr at nvidia.com>>; Majd Dibbiny <majd at nvidia.com <mailto:majd at nvidia.com>>; Sriharsha Basavapatna
>>     >> <sriharsha.basavapatna at broadcom.com <mailto:sriharsha.basavapatna at broadcom.com>>; Hemal Shah <hemal.shah at broadcom.com <mailto:hemal.shah at broadcom.com>>; Ivan Malov
>>     >> <Ivan.Malov at oktetlabs.ru <mailto:Ivan.Malov at oktetlabs.ru>>; Ferriter, Cian <cian.ferriter at intel.com <mailto:cian.ferriter at intel.com>>; Eli Britstein <elibr at nvidia.com <mailto:elibr at nvidia.com>>;
>>     >> Finn, Emma <emma.finn at intel.com <mailto:emma.finn at intel.com>>; Kovacevic, Marko <marko.kovacevic at intel.com <mailto:marko.kovacevic at intel.com>>
>>     >> Subject: [PATCH V7 06/13] dpif-netdev: Add HW miss packet state recover logic.
>>     >>
>>     >> Recover the packet if it was partially processed by the HW. Fallback to
>>     >> lookup flow by mark association.
>>     >>
>>     >> Signed-off-by: Eli Britstein <elibr at nvidia.com <mailto:elibr at nvidia.com>>
>>     >> Reviewed-by: Gaetan Rivet <gaetanr at nvidia.com <mailto:gaetanr at nvidia.com>>
>>     >> Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna at broadcom.com <mailto:sriharsha.basavapatna at broadcom.com>>
>>     >> Tested-by: Emma Finn <emma.finn at intel.com <mailto:emma.finn at intel.com>>
>>     >> Tested-by: Marko Kovacevic <marko.kovacevic at intel.com <mailto:marko.kovacevic at intel.com>>
>>     >> Signed-off-by: Ilya Maximets <i.maximets at ovn.org <mailto:i.maximets at ovn.org>>
>>     >> ---
>>     >>  lib/dpif-netdev.c | 45 +++++++++++++++++++++++++++++++++++++++++----
>>     >>  1 file changed, 41 insertions(+), 4 deletions(-)
>>     >>
>>     >> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>>     >> index 8fa7eb6d4..d5b7d5b73 100644
>>     >> --- a/lib/dpif-netdev.c
>>     >> +++ b/lib/dpif-netdev.c
>>     >> @@ -114,6 +114,7 @@ COVERAGE_DEFINE(datapath_drop_invalid_port);
>>     >>  COVERAGE_DEFINE(datapath_drop_invalid_bond);
>>     >>  COVERAGE_DEFINE(datapath_drop_invalid_tnl_port);
>>     >>  COVERAGE_DEFINE(datapath_drop_rx_invalid_packet);
>>     >> +COVERAGE_DEFINE(datapath_drop_hw_miss_recover);
>>     >>
>>     >>  /* Protects against changes to 'dp_netdevs'. */
>>     >>  static struct ovs_mutex dp_netdev_mutex = OVS_MUTEX_INITIALIZER;
>>     >> @@ -7062,6 +7063,39 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>>     >>      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT, n_smc_hit);
>>     >>  }
>>     >>
>>     >> +static struct tx_port * pmd_send_port_cache_lookup(
>>     >> +    const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
>>     >> +
>>     >> +static inline int
>>     >> +dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
>>     >> +                  odp_port_t port_no,
>>     >> +                  struct dp_packet *packet,
>>     >> +                  struct dp_netdev_flow **flow)
>>     >> +{
>>     >> +    struct tx_port *p;
>>     >> +    uint32_t mark;
>>     >> +
>>     >
>>     >
>>     > Start of full HWOL recovery block
>>     >
>>     >> +    /* Restore the packet if HW processing was terminated before completion. */
>>     >> +    p = pmd_send_port_cache_lookup(pmd, port_no);
>>     >> +    if (OVS_LIKELY(p)) {
>>     >> +        int err = netdev_hw_miss_packet_recover(p->port->netdev, packet);
>>     >> +
>>     >> +        if (err && err != EOPNOTSUPP) {
>>     >> +            COVERAGE_INC(datapath_drop_hw_miss_recover);
>>     >> +            return -1;
>>     >> +        }
>>     >> +    }
>>     >
>>     > End of full HWOL recovery block
>>     >
>>     > IIUC, the above is only relevant to full HWOL where the packet is partially processed by the HW. In a partial HWOL testcase, we see a performance drop as a result of the above code. The performance after the patchset is applied is 0.94x what the performance was before.
>>
> General speaking, adding new code in the datapath is probable to have some degradation affect, that cannot be avoided completely.
> 
> I think performance optimizations for the partial offloads (or to SW datapath in general, even w/o offloads), can be done on top, like the one suggested by Balazs above, on top of it.

+1

Anyway, I applied the v7 as-is.  Performance optimizations could be done on
top of it.

> 
> 
>>     While reviewing the patch set I noticed this part too, but
>>     this code was tested twice by Intel engineers, so I figured
>>     that it doesn't hurt performance of partial offloading.
>>
>>     In general, it should be easy to re-order partial and full
>>     offloading checks like this (didn't test):
>>
>>     diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>>     index c5ab35d2a..36a5976f2 100644
>>     --- a/lib/dpif-netdev.c
>>     +++ b/lib/dpif-netdev.c
>>     @@ -7113,6 +7113,14 @@ dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
>>          struct tx_port *p;
>>          uint32_t mark;
>>
>>     +    /* If no mark, no flow to find. */
>>     +    if (dp_packet_has_flow_mark(packet, &mark)) {
>>     +        *flow = mark_to_flow_find(pmd, mark);
>>     +        return 0;
>>     +    }
>>     +
>>     +    *flow = NULL;
>>     +
>>          /* Restore the packet if HW processing was terminated before completion. */
>>          p = pmd_send_port_cache_lookup(pmd, port_no);
>>          if (OVS_LIKELY(p)) {
>>     @@ -7123,14 +7131,6 @@ dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
>>                  return -1;
>>              }
>>          }
>>     -
>>     -    /* If no mark, no flow to find. */
>>     -    if (!dp_packet_has_flow_mark(packet, &mark)) {
>>     -        *flow = NULL;
>>     -        return 0;
>>     -    }
>>     -
>>     -    *flow = mark_to_flow_find(pmd, mark);
>>          return 0;
>>      }
>>     ---
>>
>>     If everyone agrees, I can do this change.
>>
> No, this is wrong.
> 
> mlx5 PMD uses mark (internally set) for the recover. Doing it like this will discover a mark (the internal one by the PMD), but won't find any flow associated with it.

Oh, and netdev_offload_dpdk_hw_miss_packet_recover() resets offloads for
that reason, right?

Thanks for explanation.  I think, it would be great if you can prepare
a separate patch with some comments in netdev_offload_dpdk_hw_miss_packet_recover()
regarding offloads reset and, maybe, a small comment to the dp_netdev_hw_flow()
so we will not re-order checks in the future (it's not obvious that there is a
dependency).  Something like:

"Flow marks can be used for HW miss recovery and could be re-set in the
process, therefore flow mark check should always be done after the
netdev_hw_miss_packet_recover()."

What do you think?

> 
>>
>>     >
>>     > We should branch over this code in the partial HWOL scenario, where we don't need to get the call to pmd_send_port_cache_lookup() and netdev_hw_miss_packet_recover(). We want this branch to be transparent to the user. Since both full and partial HWOL falls under the other_config:hw-offload=true switch, we might need a configure time check NIC capabilities solution or something similar to branch over full HWOL packet recovery code. Does this make sense?
>>
>>     Compile time check of capabilities doesn't make sense as it's unknown
>>     in vast majority of cases on which HW the package will run.  Code should
>>     be as generic as possible.
>>
>>     >
>>     >
>>     > perf top showing cycles spent per function in my partial HWOL scenario. We can see netdev_hw_miss_packet_recover(), netdev_offload_dpdk_hw_miss_packet_recover() and netdev_is_flow_api_enabled() taking cycles:
>>     >     28.79%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_input__
>>     >     13.72%  pmd-c01/id:8  ovs-vswitchd        [.] parse_tcp_flags
>>     >     11.07%  pmd-c01/id:8  ovs-vswitchd        [.] i40e_recv_pkts_vec_avx2
>>     >     10.94%  pmd-c01/id:8  ovs-vswitchd        [.] i40e_xmit_fixed_burst_vec_avx2
>>     >      7.48%  pmd-c01/id:8  ovs-vswitchd        [.] cmap_find
>>     >      4.94%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_hw_miss_packet_recover
>>     >      4.77%  pmd-c01/id:8  ovs-vswitchd        [.] dp_execute_output_action
>>     >      3.81%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_pmd_flush_output_on_port
>>     >      3.40%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_send
>>     >      2.49%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_dpdk_eth_send
>>     >      1.16%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_dpdk_rxq_recv
>>     >      0.90%  pmd-c01/id:8  ovs-vswitchd        [.] pmd_perf_end_iteration
>>     >      0.75%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_process_rxq_port
>>     >      0.68%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_is_flow_api_enabled
>>     >      0.55%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_offload_dpdk_hw_miss_packet_recover
>>     >
>>     >> +
>>     >> +    /* If no mark, no flow to find. */
>>     >> +    if (!dp_packet_has_flow_mark(packet, &mark)) {
>>     >> +        *flow = NULL;
>>     >> +        return 0;
>>     >> +    }
>>     >> +
>>     >> +    *flow = mark_to_flow_find(pmd, mark);
>>     >> +    return 0;
>>     >> +}
>>     >> +
>>     >>  /* Try to process all ('cnt') the 'packets' using only the datapath flow cache
>>     >>   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
>>     >>   * miniflow is copied into 'keys' and the packet pointer is moved at the
>>     >> @@ -7106,7 +7140,6 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>     >>
>>     >>      DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
>>     >>          struct dp_netdev_flow *flow;
>>     >> -        uint32_t mark;
>>     >>
>>     >>          if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
>>     >>              dp_packet_delete(packet);
>>     >> @@ -7125,9 +7158,13 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>     >>              pkt_metadata_init(&packet->md, port_no);
>>     >>          }
>>     >>
>>     >> -        if ((*recirc_depth_get() == 0) &&
>>     >> -            dp_packet_has_flow_mark(packet, &mark)) {
>>     >> -            flow = mark_to_flow_find(pmd, mark);
>>     >> +        if (netdev_is_flow_api_enabled() && *recirc_depth_get() == 0) {
>>     >
>>     > Here we have a per packet call to netdev_is_flow_api_enabled(). I think that netdev_is_flow_api_enabled() should be inlined if it's going to be called per packet. We can see from the above "perf top" that it isn't inlined since it shows up as a separate function.
>>
>>     I'd consider "inlining" and moving a lot of stuff to headers harmful
>>     for the code base as it makes it less readable and it's really hard
>>     to preserve this kind of things during code modification.
>>     It's much better to fix the logic instead of hammering the code with
>>     blind low level optimizations.
>>
>>     For this particular issue we already have a solution here:
>>       https://patchwork.ozlabs.org/project/openvswitch/patch/c0b99b4b9be6c290a6aa3cb00049fac4ebfac5d8.1618390390.git.bnemeth@redhat.com/ <https://patchwork.ozlabs.org/project/openvswitch/patch/c0b99b4b9be6c290a6aa3cb00049fac4ebfac5d8.1618390390.git.bnemeth@redhat.com/>
>>     In short, we only need to check once per batch.  I think, Balazs
>>     will be able to re-base his patch on top of this series including
>>     the check for netdev_is_flow_api_enabled().
>>
>>     Balazs, will you?
>>
>>     Best regards, Ilya Maximets.
>>