[ovs-dev] Cloning packets for "action: set field"

Samuel Ghinet sghinet at cloudbasesolutions.com
Tue Aug 26 17:40:19 UTC 2014


Hi Saurabh,

That may not be a very easy answer, but, a better way would be, I believe:
headRoom -> encapBytes
copySize -> headersSize
copySize + headRoom -> (new variable) newMdlSize

It would also be very useful to have documentation comments on the parameters, e.g.
/*
* headersSize:    the number of bytes from the beginning of the packet buffer to be copied to a new MDL.
*                       The original NBL will be advanced by headersSize, after which the NBL will be cloned, causing
*                       the cloned NBL's NB to have DataOffset = 0. This will make a Retreat operation on the cloned NBL
*                       to allocate a new MDL. This allows us to modify the packet buffer headers on the clone later on,
*                       while preserving the original NBL unmodified.
* encapBytes:     additional bytes, which will be used for prepending the encapsulation headers (e.g. VXLAN + ipv4 + eth)
*/

/*
* newMdlSize:    the size of the new MDL that will be allocated for the cloned NBL. It takes into account the size of the
*                       packet headers (if some header field needs to be modified), and the additional bytes required for encapsulation.
*/

This is a suggestion.
As for the documentary comments suggestion, please feel free to alter it in any way you think it would make code clearer.

Thanks,
Sam
     
________________________________________
From: Saurabh Shah [ssaurabh at vmware.com]
Sent: Tuesday, August 26, 2014 1:30 AM
To: Samuel Ghinet
Cc: dev at openvswitch.org
Subject: Re: Cloning packets for "action: set field"

Hi Samuel,

>Hello Saurabh,
>
>Thanks for the clarifications!
>The param name "copySize" was very ambiguous.

What name do you think will make it easy to understand? I can take this in
to account when I post a patch for -
https://github.com/openvswitch/ovs-issues/issues/28.

>
>I did not debug this exact scenario, but you must be right:
>- Cloning an original NB that has DataOffset > 0 causes the clone to have
>DataOffset = 0
>- NdisRetreatNetBufferDataStart doc says
>"
>DataOffsetDelta [in]
>
>    The amount of used data space to add. NDIS adjusts the DataOffset
>member of the NET_BUFFER structure accordingly. If there is not enough
>unused data space to satisfy the request, NDIS allocates additional
>memory.
>"
>Therefore, a retreat on the clone must allocate a new MDL. The
>NdisCopyFromNetBufferToNetBuffer, then, makes sense.
>
>And how do you do when you need to, say, set dest ipv4 addr, then set src
>ipv4 addr, then out to ports, for the same packet? Do you make two
>clones, one for each set action?

Yes, we would clone from the first clone.

>
>And if you need to:
>clone to clone1, set ipv4 dest addr1, out to port 1;
>then
>clone to clone2, set ipv4 dest addr2, out to port 2;
>clone2 will be a clone of clone1?

Yes.

>
>Thanks!
>Sam
>________________________________________
>From: Saurabh Shah [ssaurabh at vmware.com]
>Sent: Friday, August 15, 2014 8:03 AM
>To: Samuel Ghinet
>Cc: dev at openvswitch.org
>Subject: Re: Cloning packets for "action: set field"
>
>Hi Samuel,
>
>Thanks for your patience. I believe all of your concerns are arising due
>to a fundamental misunderstanding of the NDIS apis & the buffer management
>code. I thought it would useful if I walk you through the
>OvsPartialCopyNBL logic by giving an example.
>
>Let's say there is a call to OvsPartialCopyNBL(), with copySize = 50. This
>is presumably the size of the header that you want to be able to modify
>it.
>
>Say, the original NBL looks like following:
>
>Original NBL
>------------
>        |-> NB1 [DataLength = 1500, CurrentMdlOffset = 0, Next = NULL]
>             |-> MdlChain --> MDL1 [ByteCount = 1500, ByteOffset = 0,
>MappedSystemVa = 0x1000, Next = NULL]
>
>
>So, here is how the partial copy logic works -
>
>
>01. Inside partial copy API, first we advance the original NBL by copySize
>(50 bytes).
>
>After this operation the original NBL looks like following:
>
>Original NBL
>------------
>        |-> NB1 [DataLength = 1450, CurrentMdlOffset = 50, Next = NULL]
>             |-> MdlChain --> MDL1 [ByteCount = 1500, ByteOffset = 0,
>MappedSystemVa = 0x1000, Next = NULL]
>
>Note: CurrentMdlOffset & DataLength get modified.
>
>02. Allocate a clone from the Original NBL.
>
>The original NBL is unchanged & the cloned NBL looks like following:
>
>
>
>Clone NBL
>---------
>   |-> NB1 [DataLength = 1450, CurrentMdlOffset = 0, Next = NULL]
>        |-> MdlChain --> MDL2 [ByteCount = 1450, ByteOffset = 50,
>MappedSystemVa
>= 0x1000, Next = NULL]
>
>Note: The cloned NBL has a new MDL shell (i.e MDL2), but is pointing to
>the same data buffer VA:0x1000. The ByteOffset is appropriately adjusted
>in MDL2 (by 50 bytes).
>
>03. The original NBL gets retreated by copySize.
>
>With this operation, the original NBL becomes *exactly* like how the
>caller had given it.
>
>Original NBL
>------------
>        |-> NB1 [DataLength = 1500, CurrentMdlOffset = 0, Next = NULL]
>             |-> MdlChain --> MDL1 [ByteCount = 1500, ByteOffset = 0,
>MappedSystemVa = 0x1000, Next = NULL]
>
>
>04. The cloned NBL is retreated by copySize bytes (assume headRoom = 0).
>
>The clone NBL will look like following:
>
>Clone NBL
>---------
>   |-> NB1 [DataLength = 1500, CurrentMdlOffset = 0, Next = NULL]
>        |-> MdlChain --> MDL3 [ByteCount = 50, ByteOffset = 0,
>MappedSystemVa =
>0x2000, Next = MDL2]
>                           |--> MDL2 [ByteCount = 1450, ByteOffset = 50,
>MappedSystemVa =
>0x1000, Next = NULL]
>
>
>
>Note: A new memory descriptor (MDL3) is added in front of the chain and
>the new descriptor points to a completely new data buffer, VA:0x2000.
>
>05. We copy ŒcopySize¹ bytes from original NBL to the head of the cloned
>NBL. For this we use the ndis call, NdisCopyFromNetBufferToNetBuffer. NDIS
>api takes care that 'copySize¹ bytes get copied correctly. This means that
>even if the bytes where split in multiple MDLs they will correctly be
>copied to a contiguous new data buffer of the cloned NBL.
>
>Note: At the end of this step, the cloned NBL is guaranteed to have
>ŒcopySize¹ amount of data in a contiguous data buffer. And the data buffer
>is solely owned by the cloned NBL.
>
>06. Allocate NBL context for the cloned NBL, set appropriate flags in the
>NBL context & take a ref on the parent NBL context.
>
>To summarize, when partial copy returns, the caller gets a new NBL with
>copySize amount of data in a contiguous *writeable* buffer.
>
>
>Completion Path:
>----------------
>
>For most operations that require you to modify the packet, we would have
>done a partial copy at some point. After manipulating the partially copied
>NBL, the packet gets sent to the appropriate port. Completion is triggered
>when NDIS calls our 'FilterSendNetBufferListsComplete' callback. The
>complete callback will go ahead and complete the NBL it receives from
>NDIS. This will trigger a chain of NBL free's, till it reaches the root
>NBL. The root NBL is the original NBL that the extension received in the
>ŒFilterSendNetBufferLists¹ callback. When the refcount of the root NBL
>reaches 0, we propagate the completion by calling
>NdisFSendNetBufferListsComplete for all those root NBLs whose refcount
>have become 0.
>
>
>In theory this is how things are supposed to work. If you still think that
>this approach has holes, then let me know and we can do a separate IRC
>session just to discuss this. :)
>
>Thanks,
>Saurabh
>
>
>From:  Samuel Ghinet <sghinet at cloudbasesolutions.com>
>Date:  Thursday, August 7, 2014 at 10:45 AM
>To:  Saurabh Shah <ssaurabh at vmware.com>
>Cc:  "dev at openvswitch.org" <dev at openvswitch.org>
>Subject:  RE: Cloning packets for "action: set field"
>
>
>>"it sometimes happens for packets to come in FilterSentNetBufferLists
>>callback: NBL1, NBL2, NBL3 and NBL4, and only after NBL4 was finished
>>being sent
>>via NdisFSendNetBufferLists, FilterSentNetBufferListsComplete starts to
>>be called."
>>
>>I mean, to enter the FilterSentNetBufferLists callback 4 times, each time
>>for a NBL. And only after the 4th was done, the
>>FilterSentNetBufferListsComplete callback to start being called.
>>
>>This may mean that it is possible that:
>>1) cloning NBL1 into NBL2
>>2) call NdisFSendNetBufferLists on NBL1
>>3) modify NBL2
>>may still corrupt the buffer of NBL1.
>>
>>Reading from MSDN:
>>"As soon as a filter driver calls the NdisFSendNetBufferLists function,
>>it relinquishes ownership of the NET_BUFFER_LIST structures and all
>>associated resources. "
>>
>>"Until NDIS calls FilterSendNetBufferListsComplete, the current status of
>>the send request is not available to the filter driver. A filter driver
>>temporarily releases ownership of all resources that are associated with
>>a send request when it calls NdisFSendNetBufferLists. A filter driver
>>should never try to examine the NET_BUFFER_LIST structures or any
>>associated data after calling NdisFSendNetBufferLists."
>>(https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com/en-
>>u
>>s/library/windows/hardware/ff562616%28v%3Dvs.85%29.aspx&k=oIvRg1%2BdGAgOo
>>M
>>1BIlLLqw%3D%3D%0A&r=pEkjsHfytvHEWufeZPpgqSOJMdMjuZPbesVsNhCUc0E%3D%0A&m=O
>>0
>>y%2BQrr%2FNHgPJBj7fBjzyZ1IpChLYg49C9nKwT7y0M4%3D%0A&s=891f0324d2d10174698
>>9
>>9e64476a53d80d5825ba5053c86f396bfdee07663515)
>>________________________________________
>>From: Samuel Ghinet
>>Sent: Thursday, August 07, 2014 8:39 PM
>>To: Saurabh Shah
>>Cc: dev at openvswitch.org
>>Subject: RE: Cloning packets for "action: set field"
>>
>>Hello Saurabh,
>>
>>[QUOTE]For the set case, we actually send out the packet to all the
>>existing output ports before going ahead with the set. This is what the
>>function OvsOutputBeforeSetAction does. So, at any given point we only
>>have one NBL to deal with.[/QUOTE]
>>This may still not solve the problem: when the NdisFSendNetBufferLists
>>function is called, the packet is being forwarded to the underlying
>>drivers, and thus may spend some time until it is actually sent out. I
>>believe this is an asynchronous operation
>>(meaning that you just request a "send packet to other drivers"  and may
>>move forward before all is finished). I say this because I remember in my
>>tests, it sometimes happens for packets to come in
>>FilterSentNetBufferLists callback: NBL1, NBL2, NBL3 and NBL4, and only
>>after NBL4 was finished being sent
>>via NdisFSendNetBufferLists, FilterSentNetBufferListsComplete starts to
>>be called.
>>
>>[QUOTE]We actually can copy parts of the header (even if they are in a
>>single MDL), in fact our code has the infrastructure to do this. Please
>>take a look at OvsPartialCopyNBL.[/QUOTE]
>>I believe you missunderstand me: I was speaking about how
>>NdisAllocateCloneNetBufferList and NdisRetreatNetBufferDataStart work.
>>Both of them are being used by OvsPartialCopyNBL and the problem is not
>>solved.
>>The point with NdisAllocateCloneNetBufferList:
>>"The clone NET_BUFFER_LIST structure describes the same data that is
>>described by the NET_BUFFER_LIST structure at OriginalNetBufferList. NDIS
>>does not copy the data that is described by the original MDLs to new data
>>buffers. Instead, the cloned structures reference the original data
>>buffers. "
>>(https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com/en-
>>u
>>s/library/windows/hardware/ff560705%28v%3Dvs.85%29.aspx&k=oIvRg1%2BdGAgOo
>>M
>>1BIlLLqw%3D%3D%0A&r=pEkjsHfytvHEWufeZPpgqSOJMdMjuZPbesVsNhCUc0E%3D%0A&m=O
>>0
>>y%2BQrr%2FNHgPJBj7fBjzyZ1IpChLYg49C9nKwT7y0M4%3D%0A&s=f85e9ea4310586763be
>>f
>>feb77684b3e99a81a8f4a0cb5eadfb4298ee124d7d4f)
>>This means that by writing to the clone buffer you actually write to the
>>original buffer. The buffer is REFERENCED, not a new one allocated.
>>
>>The point with NdisRetreatNetBufferDataStart:
>>"If there isn't enough unused data space, this function allocates a new
>>buffer and an MDL to describe the new buffer and chains the new MDL to
>>the beginning of the MDL chain. "
>>(https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com/en-
>>u
>>s/library/windows/hardware/ff564527%28v%3Dvs.85%29.aspx&k=oIvRg1%2BdGAgOo
>>M
>>1BIlLLqw%3D%3D%0A&r=pEkjsHfytvHEWufeZPpgqSOJMdMjuZPbesVsNhCUc0E%3D%0A&m=O
>>0
>>y%2BQrr%2FNHgPJBj7fBjzyZ1IpChLYg49C9nKwT7y0M4%3D%0A&s=6cd96cceddd38b61d06
>>4
>>fdc3c602d2b0e4e4d8c0f4629ff45f16361d29cf8418)
>>This means that if you have an NB with data offset = 40, you retreat by
>>40 and write 40 bytes from the "new beginning" of the buffer, you
>>actually write to the same MDL.
>>
>>[QUOTE]The Buffer Management subsystem is one of the major difference
>>between both the drivers. Since doing a deep copy of all packets is
>>pretty sub-optimal, we (Guolin, in particular) designed the sophisticated
>>buffer management system. Yes, it is intricate at first, but we already
>>have it! (And for a while now. :))[/QUOTE]
>>I particularly find it very intricate. Perhaps it should be refactored a
>>bit and documented a bit more.
>>For instance, functions such as OvsPartialCopyNBL and
>>OvsPartialCopyToMultipleNBLs and OvsCopySinglePacketNBL and
>>OvsFullCopyNBL and OvsFullCopyToMultipleNBLs and OvsCompleteNBL I find
>>very daunting.
>>Perhaps some comments for several parameters used and some comments in
>>the function body would help as well.
>>What does "Partial copy" actually mean? I understand full copy should
>>mean something like "deep copy" / duplicate, but partial copy means
>>create new NBL and new NB, and copy src buffer -> dst buffer ( = src
>>buffer, because it's the same buffer)??
>>
>>IMPORTANT:
>>I have just made a test with your functionality from OvsPartialCopyNBL,
>>and checked the cloned NBL against original NBL:
>>original NBL: 0xffffe000`02e74b20
>>cloned NBL: 0xffffe000`02a53ac0 (ok)
>>
>>NBL's FirstNetBuffer
>>original FirstNetBuffer: 0xffffe000`02e74c80; DataOffset = 0x26;
>>DataLength = 0x5c
>>cloned FirstNetBuffer: 0xffffe000`02fe4890; DataOffset = 0; DataLength =
>>0x5c (ok)
>>
>>CurrentMdl of the FirstNetBuffer
>>original CurrentMdl: 0xffffe000`02e74da0; ByteCount = 0x50; (used space
>>starting after 0x26 bytes, see above); Next MDL count bytes = 0x32; (0x50
>>- 0x26) + 0x32 = 0x5c
>>cloned CurrentMdl: 0xffffe000`02e74dc6; ByteCount = 0x2a; Next MDL count
>>bytes = 0x32; total size from MDLs = 0x5c (NOT OK -- see below)
>>
>>If you study CurrentMdl of both, you will see:
>>original: is spanning from ...74da0 -> ...74DF0 (i.e. + 0x50), but
>>excluding the unused area: (offset = 0x26): 74DC6 -> 74DF0
>>cloned: is spanning from ...74DC6 -> ...74DF0  (i.e. + 0x32).
>>BOTH MDLS MAP THE SAME MEMORY AREA! So writing to cloneNb also writes to
>>originalNb!
>>
>>Did you study the layout of NBLs, NBs and MDLs and buffers in your memory
>>management functions?
>>
>>Sam
>>
>>________________________________________
>>From: Saurabh Shah [ssaurabh at vmware.com]
>>Sent: Wednesday, August 06, 2014 12:50 AM
>>To: Samuel Ghinet
>>Cc: dev at openvswitch.org
>>Subject: RE: Cloning packets for "action: set field"
>>
>>Hi Samuel,
>>
>>> -----Original Message-----
>>> From: Samuel Ghinet [mailto:sghinet at cloudbasesolutions.com]
>>> Sent: Tuesday, August 05, 2014 5:00 AM
>>> To: Saurabh Shah
>>> Cc: dev at openvswitch.org
>>> Subject: RE: Cloning packets for "action: set field"
>>>
>>> Hello Saurabh,
>>>
>>> I believe there are more complexities associated to this packet
>>>cloning.
>>> Consider for instance the following flow:
>>> if (conditions on packet) =>
>>> out_to_port: vm1;
>>> set ip_dest: X1; out_to_port: vm2
>>> set ip_dest: X2; out_to_port: vm3
>>>
>>> Given that multiple destinations are set, and only then the packet is
>>> outputted, then, given the issue with cloning, all packets (that will
>>>be
>>> outputted to vm1, vm2, vm3) will have the same, last destination ip.
>>>
>>
>>For the set case, we actually send out the packet to all the existing
>>output ports before going ahead with the set. This is what the function
>>OvsOutputBeforeSetAction does. So, at any given point we only have one
>>NBL to deal with.
>>
>>> I believe we cannot actually copy only the headers: if the packet
>>>contains a
>>> single MDL, there is nothing else we can do but to duplicate the
>>>packet.
>>> For the case where the NB has multiple MDLs chained, we must copy the
>>>first;
>>> MDL(s), i.e. the MDL or MDLs that contain the headers that must be
>>>copied.
>>>
>>
>>We actually can copy parts of the header (even if they are in a single
>>MDL), in fact our code has the infrastructure to do this. Please take a
>>look at OvsPartialCopyNBL.
>>
>>> There is also a problem with NBL completion (say we do the MDL copy and
>>> replacement):
>>> for the case above, consider:
>>> out_to_port: vm1;  (NB1)
>>> set ip_dest: X1; out_to_port: vm2 (clone to NB2, create new MDL, store
>>>ptr
>>> to original MDL) set ip_dest: X2; out_to_port: vm3(clone to NB3, create
>>>new
>>> MDL, store ptr to original MDL)
>>>
>>> At completion, we may need to do this:
>>> NB3: Put back old MDL into original NB, free clone NB3
>>> NB2: Put back old MDL (but already done before), and free NB2
>>> NB1: it's the original, so nothing special.
>>>
>>> In order to implement this "but already done before" case, we need to
>>>use
>>> ref counting for NBL context data (which I have seen you do have some
>>>use
>>> of ref counting) However, consider the complexity of restoring an
>>>original NB
>>> when multiple MDLs have been replaced, by multiple clones. The big
>>> problem here is that the whole MDL link would be broken.
>>>
>>
>>The Buffer Management subsystem is one of the major difference between
>>both the drivers. Since doing a deep copy of all packets is pretty
>>sub-optimal, we (Guolin, in particular) designed the sophisticated buffer
>>management system. Yes, it is intricate at first, but we already have it!
>>(And for a while now. :)) So, all that you have mentioned here has
>>already been implemented. Please take a look at our Buffer Management
>>code & let me know if something is still unclear.
>>
>>> I strongly believe that building this component with the mind set
>>>"optimize
>>> first, fix bugs later" is not at all a good way to do it. Plus, it
>>>would make the
>>> code more and more intrincate, as we continue to both find new problems
>>> and we get to have large pieces of code (dealing with NBLs) spread all
>>>over.
>>> I personally believe a better approach would be "start with something
>>> simpler, which is fully functional even though suboptimal, and then
>>>continue
>>> by building on top of it, improving incrementally for each of various
>>>cases".
>>>
>>
>>Quite the contrary. The whole point of implementing the buffer management
>>code was to layout the right infrastructure to help simplify NBL
>>management for all the consumers & avoid doing a major enhancement at a
>>later point. I can go off on a rant explaining our rationales in detail,
>>but to allay your concerns, let me assure you that we don't have a
>>mindset of optimize-over-bugfix. Once the Netlink integration is
>>complete, I expect all of us to work towards closing all critical bugs at
>>a priority.
>>
>>> Thanks!
>>> Sam
>>
>>HTH,
>>Saurabh




More information about the dev mailing list