[ovs-dev] Cloning packets for "action: set field"

Saurabh Shah ssaurabh at vmware.com
Tue Aug 26 19:29:29 UTC 2014


Hi Samuel,

> -----Original Message-----
> From: Samuel Ghinet [mailto:sghinet at cloudbasesolutions.com]
> Sent: Tuesday, August 26, 2014 10:40 AM
> To: Saurabh Shah
> Cc: dev at openvswitch.org
> Subject: RE: Cloning packets for "action: set field"
> 
> Hi Saurabh,
> 
> That may not be a very easy answer, but, a better way would be, I believe:
> headRoom -> encapBytes
> copySize -> headersSize
> copySize + headRoom -> (new variable) newMdlSize
> 

Let me think a little more about this. The functionality is meant to be generic, so I don't like calling the variable 'encapBytes', since it becomes a specific usage of the API.

> It would also be very useful to have documentation comments on the
> parameters, e.g.
> /*
> * headersSize:    the number of bytes from the beginning of the packet
> buffer to be copied to a new MDL.
> *                       The original NBL will be advanced by headersSize, after which
> the NBL will be cloned, causing
> *                       the cloned NBL's NB to have DataOffset = 0. This will make a
> Retreat operation on the cloned NBL
> *                       to allocate a new MDL. This allows us to modify the packet
> buffer headers on the clone later on,
> *                       while preserving the original NBL unmodified.
> * encapBytes:     additional bytes, which will be used for prepending the
> encapsulation headers (e.g. VXLAN + ipv4 + eth)
> */
> 
> /*
> * newMdlSize:    the size of the new MDL that will be allocated for the cloned
> NBL. It takes into account the size of the
> *                       packet headers (if some header field needs to be modified), and
> the additional bytes required for encapsulation.
> */
> 

I definitely plan to document each parameter, return value and any other assumption that the API may have.

> This is a suggestion.
> As for the documentary comments suggestion, please feel free to alter it in
> any way you think it would make code clearer.
> 

Thanks for the suggestions, it helps to get a different prespective. I will send a patch soon.

-- Saurabh

> Thanks,
> Sam
> 
> ________________________________________
> From: Saurabh Shah [ssaurabh at vmware.com]
> Sent: Tuesday, August 26, 2014 1:30 AM
> To: Samuel Ghinet
> Cc: dev at openvswitch.org
> Subject: Re: Cloning packets for "action: set field"
> 
> Hi Samuel,
> 
> >Hello Saurabh,
> >
> >Thanks for the clarifications!
> >The param name "copySize" was very ambiguous.
> 
> What name do you think will make it easy to understand? I can take this in to
> account when I post a patch for -
> https://urldefense.proofpoint.com/v1/url?u=https://github.com/openvswit
> ch/ovs-
> issues/issues/28&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=pEkjsHfy
> tvHEWufeZPpgqSOJMdMjuZPbesVsNhCUc0E%3D%0A&m=v6onB5k72z1unZ
> NkjO1VAZRXD%2Fkz98AovnuDeYkye6A%3D%0A&s=9a95e049cbdfedbaabc91
> 364ac73e0d558e18c32e79bfcd2f1cb1eb8aaf71894.
> 
> >
> >I did not debug this exact scenario, but you must be right:
> >- Cloning an original NB that has DataOffset > 0 causes the clone to
> >have DataOffset = 0
> >- NdisRetreatNetBufferDataStart doc says "
> >DataOffsetDelta [in]
> >
> >    The amount of used data space to add. NDIS adjusts the DataOffset
> >member of the NET_BUFFER structure accordingly. If there is not enough
> >unused data space to satisfy the request, NDIS allocates additional
> >memory.
> >"
> >Therefore, a retreat on the clone must allocate a new MDL. The
> >NdisCopyFromNetBufferToNetBuffer, then, makes sense.
> >
> >And how do you do when you need to, say, set dest ipv4 addr, then set
> >src
> >ipv4 addr, then out to ports, for the same packet? Do you make two
> >clones, one for each set action?
> 
> Yes, we would clone from the first clone.
> 
> >
> >And if you need to:
> >clone to clone1, set ipv4 dest addr1, out to port 1; then clone to
> >clone2, set ipv4 dest addr2, out to port 2;
> >clone2 will be a clone of clone1?
> 
> Yes.
> 
> >
> >Thanks!
> >Sam
> >________________________________________
> >From: Saurabh Shah [ssaurabh at vmware.com]
> >Sent: Friday, August 15, 2014 8:03 AM
> >To: Samuel Ghinet
> >Cc: dev at openvswitch.org
> >Subject: Re: Cloning packets for "action: set field"
> >
> >Hi Samuel,
> >
> >Thanks for your patience. I believe all of your concerns are arising
> >due to a fundamental misunderstanding of the NDIS apis & the buffer
> >management code. I thought it would useful if I walk you through the
> >OvsPartialCopyNBL logic by giving an example.
> >
> >Let's say there is a call to OvsPartialCopyNBL(), with copySize = 50.
> >This is presumably the size of the header that you want to be able to
> >modify it.
> >
> >Say, the original NBL looks like following:
> >
> >Original NBL
> >------------
> >        |-> NB1 [DataLength = 1500, CurrentMdlOffset = 0, Next = NULL]
> >             |-> MdlChain --> MDL1 [ByteCount = 1500, ByteOffset = 0,
> >MappedSystemVa = 0x1000, Next = NULL]
> >
> >
> >So, here is how the partial copy logic works -
> >
> >
> >01. Inside partial copy API, first we advance the original NBL by
> >copySize
> >(50 bytes).
> >
> >After this operation the original NBL looks like following:
> >
> >Original NBL
> >------------
> >        |-> NB1 [DataLength = 1450, CurrentMdlOffset = 50, Next = NULL]
> >             |-> MdlChain --> MDL1 [ByteCount = 1500, ByteOffset = 0,
> >MappedSystemVa = 0x1000, Next = NULL]
> >
> >Note: CurrentMdlOffset & DataLength get modified.
> >
> >02. Allocate a clone from the Original NBL.
> >
> >The original NBL is unchanged & the cloned NBL looks like following:
> >
> >
> >
> >Clone NBL
> >---------
> >   |-> NB1 [DataLength = 1450, CurrentMdlOffset = 0, Next = NULL]
> >        |-> MdlChain --> MDL2 [ByteCount = 1450, ByteOffset = 50,
> >MappedSystemVa = 0x1000, Next = NULL]
> >
> >Note: The cloned NBL has a new MDL shell (i.e MDL2), but is pointing to
> >the same data buffer VA:0x1000. The ByteOffset is appropriately
> >adjusted in MDL2 (by 50 bytes).
> >
> >03. The original NBL gets retreated by copySize.
> >
> >With this operation, the original NBL becomes *exactly* like how the
> >caller had given it.
> >
> >Original NBL
> >------------
> >        |-> NB1 [DataLength = 1500, CurrentMdlOffset = 0, Next = NULL]
> >             |-> MdlChain --> MDL1 [ByteCount = 1500, ByteOffset = 0,
> >MappedSystemVa = 0x1000, Next = NULL]
> >
> >
> >04. The cloned NBL is retreated by copySize bytes (assume headRoom = 0).
> >
> >The clone NBL will look like following:
> >
> >Clone NBL
> >---------
> >   |-> NB1 [DataLength = 1500, CurrentMdlOffset = 0, Next = NULL]
> >        |-> MdlChain --> MDL3 [ByteCount = 50, ByteOffset = 0,
> >MappedSystemVa = 0x2000, Next = MDL2]
> >                           |--> MDL2 [ByteCount = 1450, ByteOffset =
> >50, MappedSystemVa = 0x1000, Next = NULL]
> >
> >
> >
> >Note: A new memory descriptor (MDL3) is added in front of the chain and
> >the new descriptor points to a completely new data buffer, VA:0x2000.
> >
> >05. We copy ŒcopySize¹ bytes from original NBL to the head of the
> >cloned NBL. For this we use the ndis call,
> >NdisCopyFromNetBufferToNetBuffer. NDIS api takes care that 'copySize¹
> >bytes get copied correctly. This means that even if the bytes where
> >split in multiple MDLs they will correctly be copied to a contiguous new data
> buffer of the cloned NBL.
> >
> >Note: At the end of this step, the cloned NBL is guaranteed to have
> >ŒcopySize¹ amount of data in a contiguous data buffer. And the data
> >buffer is solely owned by the cloned NBL.
> >
> >06. Allocate NBL context for the cloned NBL, set appropriate flags in
> >the NBL context & take a ref on the parent NBL context.
> >
> >To summarize, when partial copy returns, the caller gets a new NBL with
> >copySize amount of data in a contiguous *writeable* buffer.
> >
> >
> >Completion Path:
> >----------------
> >
> >For most operations that require you to modify the packet, we would
> >have done a partial copy at some point. After manipulating the
> >partially copied NBL, the packet gets sent to the appropriate port.
> >Completion is triggered when NDIS calls our
> >'FilterSendNetBufferListsComplete' callback. The complete callback will
> >go ahead and complete the NBL it receives from NDIS. This will trigger
> >a chain of NBL free's, till it reaches the root NBL. The root NBL is
> >the original NBL that the extension received in the
> >ŒFilterSendNetBufferLists¹ callback. When the refcount of the root NBL
> >reaches 0, we propagate the completion by calling
> >NdisFSendNetBufferListsComplete for all those root NBLs whose refcount
> have become 0.
> >
> >
> >In theory this is how things are supposed to work. If you still think
> >that this approach has holes, then let me know and we can do a separate
> >IRC session just to discuss this. :)
> >
> >Thanks,
> >Saurabh
> >
> >
> >From:  Samuel Ghinet <sghinet at cloudbasesolutions.com>
> >Date:  Thursday, August 7, 2014 at 10:45 AM
> >To:  Saurabh Shah <ssaurabh at vmware.com>
> >Cc:  "dev at openvswitch.org" <dev at openvswitch.org>
> >Subject:  RE: Cloning packets for "action: set field"
> >
> >
> >>"it sometimes happens for packets to come in FilterSentNetBufferLists
> >>callback: NBL1, NBL2, NBL3 and NBL4, and only after NBL4 was finished
> >>being sent via NdisFSendNetBufferLists,
> >>FilterSentNetBufferListsComplete starts to be called."
> >>
> >>I mean, to enter the FilterSentNetBufferLists callback 4 times, each
> >>time for a NBL. And only after the 4th was done, the
> >>FilterSentNetBufferListsComplete callback to start being called.
> >>
> >>This may mean that it is possible that:
> >>1) cloning NBL1 into NBL2
> >>2) call NdisFSendNetBufferLists on NBL1
> >>3) modify NBL2
> >>may still corrupt the buffer of NBL1.
> >>
> >>Reading from MSDN:
> >>"As soon as a filter driver calls the NdisFSendNetBufferLists
> >>function, it relinquishes ownership of the NET_BUFFER_LIST structures
> >>and all associated resources. "
> >>
> >>"Until NDIS calls FilterSendNetBufferListsComplete, the current status
> >>of the send request is not available to the filter driver. A filter
> >>driver temporarily releases ownership of all resources that are
> >>associated with a send request when it calls NdisFSendNetBufferLists.
> >>A filter driver should never try to examine the NET_BUFFER_LIST
> >>structures or any associated data after calling NdisFSendNetBufferLists."
> >>(https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com
> /
> >>en-
> >>u
> >>s/library/windows/hardware/ff562616%28v%3Dvs.85%29.aspx&k=oIvRg1
> %2BdGA
> >>gOo
> >>M
> >>1BIlLLqw%3D%3D%0A&r=pEkjsHfytvHEWufeZPpgqSOJMdMjuZPbesVsNh
> CUc0E%3D%0A&
> >>m=O
> >>0
> >>y%2BQrr%2FNHgPJBj7fBjzyZ1IpChLYg49C9nKwT7y0M4%3D%0A&s=891f03
> 24d2d10174
> >>698
> >>9
> >>9e64476a53d80d5825ba5053c86f396bfdee07663515)
> >>________________________________________
> >>From: Samuel Ghinet
> >>Sent: Thursday, August 07, 2014 8:39 PM
> >>To: Saurabh Shah
> >>Cc: dev at openvswitch.org
> >>Subject: RE: Cloning packets for "action: set field"
> >>
> >>Hello Saurabh,
> >>
> >>[QUOTE]For the set case, we actually send out the packet to all the
> >>existing output ports before going ahead with the set. This is what
> >>the function OvsOutputBeforeSetAction does. So, at any given point we
> >>only have one NBL to deal with.[/QUOTE] This may still not solve the
> >>problem: when the NdisFSendNetBufferLists function is called, the
> >>packet is being forwarded to the underlying drivers, and thus may
> >>spend some time until it is actually sent out. I believe this is an
> >>asynchronous operation (meaning that you just request a "send packet
> >>to other drivers"  and may move forward before all is finished). I say
> >>this because I remember in my tests, it sometimes happens for packets
> >>to come in FilterSentNetBufferLists callback: NBL1, NBL2, NBL3 and
> >>NBL4, and only after NBL4 was finished being sent via
> >>NdisFSendNetBufferLists, FilterSentNetBufferListsComplete starts to be
> >>called.
> >>
> >>[QUOTE]We actually can copy parts of the header (even if they are in a
> >>single MDL), in fact our code has the infrastructure to do this.
> >>Please take a look at OvsPartialCopyNBL.[/QUOTE] I believe you
> >>missunderstand me: I was speaking about how
> >>NdisAllocateCloneNetBufferList and NdisRetreatNetBufferDataStart work.
> >>Both of them are being used by OvsPartialCopyNBL and the problem is
> >>not solved.
> >>The point with NdisAllocateCloneNetBufferList:
> >>"The clone NET_BUFFER_LIST structure describes the same data that is
> >>described by the NET_BUFFER_LIST structure at OriginalNetBufferList.
> >>NDIS does not copy the data that is described by the original MDLs to
> >>new data buffers. Instead, the cloned structures reference the
> >>original data buffers. "
> >>(https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com
> /
> >>en-
> >>u
> >>s/library/windows/hardware/ff560705%28v%3Dvs.85%29.aspx&k=oIvRg1
> %2BdGA
> >>gOo
> >>M
> >>1BIlLLqw%3D%3D%0A&r=pEkjsHfytvHEWufeZPpgqSOJMdMjuZPbesVsNh
> CUc0E%3D%0A&
> >>m=O
> >>0
> >>y%2BQrr%2FNHgPJBj7fBjzyZ1IpChLYg49C9nKwT7y0M4%3D%0A&s=f85e9e
> a431058676
> >>3be
> >>f
> >>feb77684b3e99a81a8f4a0cb5eadfb4298ee124d7d4f)
> >>This means that by writing to the clone buffer you actually write to
> >>the original buffer. The buffer is REFERENCED, not a new one allocated.
> >>
> >>The point with NdisRetreatNetBufferDataStart:
> >>"If there isn't enough unused data space, this function allocates a
> >>new buffer and an MDL to describe the new buffer and chains the new
> >>MDL to the beginning of the MDL chain. "
> >>(https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com
> /
> >>en-
> >>u
> >>s/library/windows/hardware/ff564527%28v%3Dvs.85%29.aspx&k=oIvRg1
> %2BdGA
> >>gOo
> >>M
> >>1BIlLLqw%3D%3D%0A&r=pEkjsHfytvHEWufeZPpgqSOJMdMjuZPbesVsNh
> CUc0E%3D%0A&
> >>m=O
> >>0
> >>y%2BQrr%2FNHgPJBj7fBjzyZ1IpChLYg49C9nKwT7y0M4%3D%0A&s=6cd96c
> ceddd38b61
> >>d06
> >>4
> >>fdc3c602d2b0e4e4d8c0f4629ff45f16361d29cf8418)
> >>This means that if you have an NB with data offset = 40, you retreat
> >>by
> >>40 and write 40 bytes from the "new beginning" of the buffer, you
> >>actually write to the same MDL.
> >>
> >>[QUOTE]The Buffer Management subsystem is one of the major
> difference
> >>between both the drivers. Since doing a deep copy of all packets is
> >>pretty sub-optimal, we (Guolin, in particular) designed the
> >>sophisticated buffer management system. Yes, it is intricate at first,
> >>but we already have it! (And for a while now. :))[/QUOTE] I
> >>particularly find it very intricate. Perhaps it should be refactored a
> >>bit and documented a bit more.
> >>For instance, functions such as OvsPartialCopyNBL and
> >>OvsPartialCopyToMultipleNBLs and OvsCopySinglePacketNBL and
> >>OvsFullCopyNBL and OvsFullCopyToMultipleNBLs and OvsCompleteNBL I
> find
> >>very daunting.
> >>Perhaps some comments for several parameters used and some
> comments in
> >>the function body would help as well.
> >>What does "Partial copy" actually mean? I understand full copy should
> >>mean something like "deep copy" / duplicate, but partial copy means
> >>create new NBL and new NB, and copy src buffer -> dst buffer ( = src
> >>buffer, because it's the same buffer)??
> >>
> >>IMPORTANT:
> >>I have just made a test with your functionality from
> >>OvsPartialCopyNBL, and checked the cloned NBL against original NBL:
> >>original NBL: 0xffffe000`02e74b20
> >>cloned NBL: 0xffffe000`02a53ac0 (ok)
> >>
> >>NBL's FirstNetBuffer
> >>original FirstNetBuffer: 0xffffe000`02e74c80; DataOffset = 0x26;
> >>DataLength = 0x5c cloned FirstNetBuffer: 0xffffe000`02fe4890;
> >>DataOffset = 0; DataLength = 0x5c (ok)
> >>
> >>CurrentMdl of the FirstNetBuffer
> >>original CurrentMdl: 0xffffe000`02e74da0; ByteCount = 0x50; (used
> >>space starting after 0x26 bytes, see above); Next MDL count bytes =
> >>0x32; (0x50
> >>- 0x26) + 0x32 = 0x5c
> >>cloned CurrentMdl: 0xffffe000`02e74dc6; ByteCount = 0x2a; Next MDL
> >>count bytes = 0x32; total size from MDLs = 0x5c (NOT OK -- see below)
> >>
> >>If you study CurrentMdl of both, you will see:
> >>original: is spanning from ...74da0 -> ...74DF0 (i.e. + 0x50), but
> >>excluding the unused area: (offset = 0x26): 74DC6 -> 74DF0
> >>cloned: is spanning from ...74DC6 -> ...74DF0  (i.e. + 0x32).
> >>BOTH MDLS MAP THE SAME MEMORY AREA! So writing to cloneNb also
> writes
> >>to originalNb!
> >>
> >>Did you study the layout of NBLs, NBs and MDLs and buffers in your
> >>memory management functions?
> >>
> >>Sam
> >>
> >>________________________________________
> >>From: Saurabh Shah [ssaurabh at vmware.com]
> >>Sent: Wednesday, August 06, 2014 12:50 AM
> >>To: Samuel Ghinet
> >>Cc: dev at openvswitch.org
> >>Subject: RE: Cloning packets for "action: set field"
> >>
> >>Hi Samuel,
> >>
> >>> -----Original Message-----
> >>> From: Samuel Ghinet [mailto:sghinet at cloudbasesolutions.com]
> >>> Sent: Tuesday, August 05, 2014 5:00 AM
> >>> To: Saurabh Shah
> >>> Cc: dev at openvswitch.org
> >>> Subject: RE: Cloning packets for "action: set field"
> >>>
> >>> Hello Saurabh,
> >>>
> >>> I believe there are more complexities associated to this packet
> >>>cloning.
> >>> Consider for instance the following flow:
> >>> if (conditions on packet) =>
> >>> out_to_port: vm1;
> >>> set ip_dest: X1; out_to_port: vm2
> >>> set ip_dest: X2; out_to_port: vm3
> >>>
> >>> Given that multiple destinations are set, and only then the packet
> >>>is  outputted, then, given the issue with cloning, all packets (that
> >>>will be  outputted to vm1, vm2, vm3) will have the same, last
> >>>destination ip.
> >>>
> >>
> >>For the set case, we actually send out the packet to all the existing
> >>output ports before going ahead with the set. This is what the
> >>function OvsOutputBeforeSetAction does. So, at any given point we only
> >>have one NBL to deal with.
> >>
> >>> I believe we cannot actually copy only the headers: if the packet
> >>>contains a  single MDL, there is nothing else we can do but to
> >>>duplicate the packet.
> >>> For the case where the NB has multiple MDLs chained, we must copy
> >>>the first;  MDL(s), i.e. the MDL or MDLs that contain the headers
> >>>that must be copied.
> >>>
> >>
> >>We actually can copy parts of the header (even if they are in a single
> >>MDL), in fact our code has the infrastructure to do this. Please take
> >>a look at OvsPartialCopyNBL.
> >>
> >>> There is also a problem with NBL completion (say we do the MDL copy
> >>>and
> >>> replacement):
> >>> for the case above, consider:
> >>> out_to_port: vm1;  (NB1)
> >>> set ip_dest: X1; out_to_port: vm2 (clone to NB2, create new MDL,
> >>>store ptr  to original MDL) set ip_dest: X2; out_to_port: vm3(clone
> >>>to NB3, create new  MDL, store ptr to original MDL)
> >>>
> >>> At completion, we may need to do this:
> >>> NB3: Put back old MDL into original NB, free clone NB3
> >>> NB2: Put back old MDL (but already done before), and free NB2
> >>> NB1: it's the original, so nothing special.
> >>>
> >>> In order to implement this "but already done before" case, we need
> >>>to use  ref counting for NBL context data (which I have seen you do
> >>>have some use  of ref counting) However, consider the complexity of
> >>>restoring an original NB  when multiple MDLs have been replaced, by
> >>>multiple clones. The big  problem here is that the whole MDL link
> >>>would be broken.
> >>>
> >>
> >>The Buffer Management subsystem is one of the major difference
> between
> >>both the drivers. Since doing a deep copy of all packets is pretty
> >>sub-optimal, we (Guolin, in particular) designed the sophisticated
> >>buffer management system. Yes, it is intricate at first, but we already have
> it!
> >>(And for a while now. :)) So, all that you have mentioned here has
> >>already been implemented. Please take a look at our Buffer Management
> >>code & let me know if something is still unclear.
> >>
> >>> I strongly believe that building this component with the mind set
> >>>"optimize  first, fix bugs later" is not at all a good way to do it.
> >>>Plus, it would make the  code more and more intrincate, as we
> >>>continue to both find new problems  and we get to have large pieces
> >>>of code (dealing with NBLs) spread all over.
> >>> I personally believe a better approach would be "start with
> >>>something  simpler, which is fully functional even though suboptimal,
> >>>and then continue  by building on top of it, improving incrementally
> >>>for each of various cases".
> >>>
> >>
> >>Quite the contrary. The whole point of implementing the buffer
> >>management code was to layout the right infrastructure to help
> >>simplify NBL management for all the consumers & avoid doing a major
> >>enhancement at a later point. I can go off on a rant explaining our
> >>rationales in detail, but to allay your concerns, let me assure you
> >>that we don't have a mindset of optimize-over-bugfix. Once the Netlink
> >>integration is complete, I expect all of us to work towards closing
> >>all critical bugs at a priority.
> >>
> >>> Thanks!
> >>> Sam
> >>
> >>HTH,
> >>Saurabh




More information about the dev mailing list