[ovs-dev] [PATCH v2 6/8] netdev-dpdk: retry with queue action

Fri Sep 15 19:55:27 UTC 2017

    -----Original Message-----
    From: Darrell Ball [mailto:dball at vmware.com]
    Sent: 15. september 2017 20:09
    To: Finn Christensen <fc at napatech.com>
    Cc: dev at openvswitch.org
    Subject: Re: [ovs-dev] [PATCH v2 6/8] netdev-dpdk: retry with queue action

    On 9/15/17, 2:16 AM, "Finn Christensen" <fc at napatech.com> wrote:

            -----Original Message-----
            From: Darrell Ball [mailto:dball at vmware.com]
            Sent: 14. september 2017 21:35
            To: Finn Christensen <fc at napatech.com>
            Cc: dev at openvswitch.org
            Subject: Re: [ovs-dev] [PATCH v2 6/8] netdev-dpdk: retry with queue
            action

            On 9/14/17, 1:14 AM, "Finn Christensen" <fc at napatech.com> wrote:

                -----Original Message-----

                From: ovs-dev-bounces at openvswitch.org [mailto:ovs-dev-
            bounces at openvswitch.org] On Behalf Of Darrell Ball

                Sent: 13. september 2017 18:18

                To: Simon Horman <simon.horman at netronome.com>

                Cc: dev at openvswitch.org

                Subject: Re: [ovs-dev] [PATCH v2 6/8] netdev-dpdk: retry with queue
            action

                On 9/13/17, 2:57 AM, "Simon Horman"
            <simon.horman at netronome.com> wrote:

                    On Tue, Sep 12, 2017 at 08:36:19AM +0000, Darrell Ball wrote:

                    >

                    > On 9/10/17, 11:14 PM, "ovs-dev-bounces at openvswitch.org on
            behalf of Yuanhan Liu" <ovs-dev-bounces at openvswitch.org on behalf
    of
            yliu at fridaylinux.org> wrote:

                    >

                    >     On Fri, Sep 08, 2017 at 06:48:50PM +0200, Simon Horman
    wrote:

                    >     > On Tue, Sep 05, 2017 at 05:22:59PM +0800, Yuanhan Liu
    wrote:

                    >     > > From: Finn Christensen <fc at napatech.com>

                    >     > >

                    >     > > AFAIK, most (if not all) NICs (including Mellanox and Intel)
    do
            not

                    >     > > support a pure MARK action.  It's required to be used
    together
            with

                    >     > > some other actions, like QUEUE.

                    >     > >

                    >     > > To workaround it, retry with a queue action when first try
            failed.

                    >     > >

                    >     > > Moreover, some Intel's NIC (say XL710) needs the QUEUE
            action set

                    >     > > before the MARK action.

                    >     > >

                    >     > > Co-authored-by: Yuanhan Liu <yliu at fridaylinux.org>

                    >     > > Signed-off-by: Finn Christensen <fc at napatech.com>

                    >     > > Signed-off-by: Yuanhan Liu <yliu at fridaylinux.org>

                    >     >

                    >     > This feels a bit like the tail wagging the dog.

                    >     > Is this the lowest level at which it makes sense to implement

                    >     > this logic?

                    >     >

                    >     > If so then I wonder if some sort of probing would be in order

                    >     > to avoid the cost of trying to add the flow twice to hardware

                    >     > where the queue is required.

                    >

                    >     Do you mean something like rte_flow capability query, like
            whether

                    >     a queue action is needed for a mark action? If so, yes, I do
    think

                    >     we miss an interface like this.

                    >

                    >     Note that even in this solution, the flow won't be created
    twice

                    >     to the hardware, because the first try would be failed.

                    >

                    > [Darrell]

                    >

                    >               Having an api to quey capability and avoid the first try to
    HW
            would be nice, but there are dependencies

                    >                on RTE, drivers etc and I don’t know definitive the api
    would
            be.

                    >

                    >              Also, as nics are added this capability needs to be done
    and
            state needs to be kept in all cases.

                    >

                    >            It is an enhancement and if done should be reliable.

                    Agreed. Though I was more thinking of probing the hardware
    rather
            than

                    having a capability API - I expect this would remove several of the

                    dependencies you describe above.

                [Darrell] I have been pondering the probing option as well. It is
    certainly a
            valid option; we use it in other cases such as datapath probing. One of
    the
            aspects that worries me here is maintaining the correct per interface
            (essentially; although the attribute is per nic) state across various
    events
            such as new ports being added, vswitchd restarts, races with flow
    creation.
            It would be non-trivial I guess and probably appropriate for the next
    patch
            series, if done.

                In this case, we have what seems like a clear distinction b/w
    Napatech
            which does not need the queue action workaround and everything
    else,
            which does.

                Besides the non-Napatech behavior, which is worrisome,
    maintaining the
            difference for flow handling under the covers is concerning.

                I wonder if we should be upfront as possible here and just have a
    dpdk
            interface configuration – maybe something like “supports native
    HWOL
            mark action” since the better behavior is the exception?

                The interface config would be more robust than probing.

                This would need documentation, of course.

                [Finn]

                I think the rte queue action should never be used here when using
            partial HWOL. Not the way OVS handles multi queues today.

                Maybe a "default queue" could be used in the dpdk PMDs when no
            queue is specified in rte flow?

            [Darrell] This is the Napatech case, where no queue action is needed;
    you
            are suggesting programming a default queue in this case.
            I don’t follow how this would be helpful/desired?

        [Finn]
        I was trying to make my view on this, not particularly arguing for the
    Napatech
        Case.
        Here is what I was thinking:
        Taking the case of this partial HWOL, then we are trying to offload the
    flow
        classification to HW, like "pre-classify and mark". Then this mark is used
    to
        accelerate OVS while finding the actions to execute. Since we do leave all
        processing of the actions to OVS, there is no way for the partial HWOL to
        know, at rte flow creation time, where to send the pre-classified packets
        (which is strictly needed when seen from a DPDK rte flow point of view).
        When multiple queues are specified in OVS, a hash splitting mechanism
        is used in the nic to support RSS. Then the nic is responsible for selecting
    the
        right queue according to the configured algorithm for RSS.
        OVS only needs to know how many queues to service per port. No
    knowledge
        about the association b/w flow <-> rxq is used in OVS today.
        When looking at full HWOL, all flow actions will have to be interpreted,
    supported
        and programmed to NIC using rte flow - send directly to target port and
    we
        do not have this issue.
        Have I missed something?

    [Darrell]
    I understand you points about the full offload you are working on.
    I just did not follow this one comment:

    “Maybe a "default queue" could be used in the dpdk PMDs when no
            queue is specified in rte flow?”

    which I thought is related to partial offloads used here and the receive
    queue action we are discussing?; I am concerned that I missed your point
    about this specific comment ?

[Finn]
Sorry for my messy descriptions, and yes the "default queue" idea was for the 
partial HWOL case. I just think if the queue could be set to 0 if 1 rxq used and let 
the hash algorithm set the queue if RSS is used, then we do not have to add QUEUE 
to the rte flow create. 
However, it was just an idea and I don't know if it can be done reliably in all DPDK PMDs.

                Essentially, this is a mismatch between the rte flow impl
    functionality in
            PMD and the needs in OVS for flow classification, in a partial HWOL
    setup.

            [Darrell] Maybe this has been explored, but were additional
    workaround
            actions, like RTE_FLOW_ACTION_TYPE_RSS considered ?
                             I guess we could make this essentially a NOP, but if mark
    action
            could ride along with it, then that would be good.

        [Finn]
        No. We have not explored that.
        I'm not sure I understand how you would use
    RTE_FLOW_ACTION_TYPE_RSS
        In this situation.

    [Darrell]
    So there are two points here:
    1/ ‘IF’ RTE_FLOW_ACTION_TYPE_RSS can used in lieu of
    RTE_FLOW_ACTION_TYPE_QUEUE combined with
    RTE_FLOW_ACTION_TYPE_MARK, as need for the workaround for most
    nics, then this would be better for distributing packets across PMDs for the
    masked flows relevant here.  This would help mitigate one of the concerns.
    We could use the RSS we use globally, essentially making the distribution a
    NOP compared with what we normally do today with RSS….

    2/ Assuming ‘1’, we might even use this in all cases as a first try for all nics,
    which if it could be done solves the queue action workaround
    capability/probing thing?

[Finn] Thanks, I see what you mean. It makes very good sense if it is supported
on the nic DPDK PMDs.

                However, we would not mind Darells proposal, it makes sense since
            most nic PMDs unfortunately needs this and it is only relevant for
    partial
            HWOL. We are mostly concerned with full HWOL and therefore see
    partial
            HWOL as a failover when full HWOL could not be handled in nic, if
    enabled.

                I think anyways we need documentation describing the difference
    b/w
            nics in the dpdk documentation (howto part).

                    Assuming no such enhancement is appropriate at this time I
    would

                    still like to ask if this is the best place for this hardware-specific
    code?

                [Darrell]

                For OVS, the netdev-dpdk layer is the lowest layer.

                This kind of workaround is hard to hide, since we are messing with
    the
            rxq, so I think OVS needs to know that it is in effect anyways. An
            alternative is to supply a mark and an ‘optional queue’ and let the
    driver
            decide if the queue is needed and report back whether it was. This
    would
            be hard to do across various drivers. Supporting in the rte layer would
            require both rte and driver support, so even more support.

                    >            A separate comment is we need to document which nics
    need
            the queue action.

                    >

                    >          Also, I think we should check errno in the present code.

                _______________________________________________

                dev mailing list

                dev at openvswitch.org

                https://urldefense.proofpoint.com/v2/url?u=https-
            3A__mail.openvswitch.org_mailman_listinfo_ovs-

    2Ddev&d=DwIGaQ&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-
            uZnsw&m=IHypHavCy0AKjNxqOMyc4w3ILyC-

    BuwkB8fuVvQUA3k&s=deJQWP9KI22Xp46tEoZ6o6Emitr3Bhfd7iSMxNpude
            g&e=

                Disclaimer: This email and any files transmitted with it may contain
            confidential information intended for the addressee(s) only. The
            information is not to be surrendered or copied to unauthorized
    persons. If
            you have received this communication in error, please notify the
    sender
            immediately and delete this e-mail from your system.

        Disclaimer: This email and any files transmitted with it may contain
    confidential information intended for the addressee(s) only. The
    information is not to be surrendered or copied to unauthorized persons. If
    you have received this communication in error, please notify the sender
    immediately and delete this e-mail from your system.