[ovs-dev] Cloning packets for "action: set field"

Samuel Ghinet sghinet at cloudbasesolutions.com
Thu Aug 7 17:45:49 UTC 2014


"it sometimes happens for packets to come in FilterSentNetBufferLists callback: NBL1, NBL2, NBL3 and NBL4, and only after NBL4 was finished being sent
via NdisFSendNetBufferLists, FilterSentNetBufferListsComplete starts to be called."

I mean, to enter the FilterSentNetBufferLists callback 4 times, each time for a NBL. And only after the 4th was done, the FilterSentNetBufferListsComplete callback to start being called.

This may mean that it is possible that:
1) cloning NBL1 into NBL2
2) call NdisFSendNetBufferLists on NBL1
3) modify NBL2
may still corrupt the buffer of NBL1.

Reading from MSDN:
"As soon as a filter driver calls the NdisFSendNetBufferLists function, it relinquishes ownership of the NET_BUFFER_LIST structures and all associated resources. "

"Until NDIS calls FilterSendNetBufferListsComplete, the current status of the send request is not available to the filter driver. A filter driver temporarily releases ownership of all resources that are associated with a send request when it calls NdisFSendNetBufferLists. A filter driver should never try to examine the NET_BUFFER_LIST structures or any associated data after calling NdisFSendNetBufferLists."
(http://msdn.microsoft.com/en-us/library/windows/hardware/ff562616%28v=vs.85%29.aspx)
________________________________________
From: Samuel Ghinet
Sent: Thursday, August 07, 2014 8:39 PM
To: Saurabh Shah
Cc: dev at openvswitch.org
Subject: RE: Cloning packets for "action: set field"

Hello Saurabh,

[QUOTE]For the set case, we actually send out the packet to all the existing output ports before going ahead with the set. This is what the function OvsOutputBeforeSetAction does. So, at any given point we only have one NBL to deal with.[/QUOTE]
This may still not solve the problem: when the NdisFSendNetBufferLists function is called, the packet is being forwarded to the underlying drivers, and thus may spend some time until it is actually sent out. I believe this is an asynchronous operation
(meaning that you just request a "send packet to other drivers"  and may move forward before all is finished). I say this because I remember in my tests, it sometimes happens for packets to come in FilterSentNetBufferLists callback: NBL1, NBL2, NBL3 and NBL4, and only after NBL4 was finished being sent
via NdisFSendNetBufferLists, FilterSentNetBufferListsComplete starts to be called.

[QUOTE]We actually can copy parts of the header (even if they are in a single MDL), in fact our code has the infrastructure to do this. Please take a look at OvsPartialCopyNBL.[/QUOTE]
I believe you missunderstand me: I was speaking about how NdisAllocateCloneNetBufferList and NdisRetreatNetBufferDataStart work. Both of them are being used by OvsPartialCopyNBL and the problem is not solved.
The point with NdisAllocateCloneNetBufferList:
"The clone NET_BUFFER_LIST structure describes the same data that is described by the NET_BUFFER_LIST structure at OriginalNetBufferList. NDIS does not copy the data that is described by the original MDLs to new data buffers. Instead, the cloned structures reference the original data buffers. "
(http://msdn.microsoft.com/en-us/library/windows/hardware/ff560705%28v=vs.85%29.aspx)
This means that by writing to the clone buffer you actually write to the original buffer. The buffer is REFERENCED, not a new one allocated.

The point with NdisRetreatNetBufferDataStart:
"If there isn't enough unused data space, this function allocates a new buffer and an MDL to describe the new buffer and chains the new MDL to the beginning of the MDL chain. "
(http://msdn.microsoft.com/en-us/library/windows/hardware/ff564527%28v=vs.85%29.aspx)
This means that if you have an NB with data offset = 40, you retreat by 40 and write 40 bytes from the "new beginning" of the buffer, you actually write to the same MDL.

[QUOTE]The Buffer Management subsystem is one of the major difference between both the drivers. Since doing a deep copy of all packets is pretty sub-optimal, we (Guolin, in particular) designed the sophisticated buffer management system. Yes, it is intricate at first, but we already have it! (And for a while now. :))[/QUOTE]
I particularly find it very intricate. Perhaps it should be refactored a bit and documented a bit more.
For instance, functions such as OvsPartialCopyNBL and OvsPartialCopyToMultipleNBLs and OvsCopySinglePacketNBL and OvsFullCopyNBL and OvsFullCopyToMultipleNBLs and OvsCompleteNBL I find very daunting.
Perhaps some comments for several parameters used and some comments in the function body would help as well.
What does "Partial copy" actually mean? I understand full copy should mean something like "deep copy" / duplicate, but partial copy means create new NBL and new NB, and copy src buffer -> dst buffer ( = src buffer, because it's the same buffer)??

IMPORTANT:
I have just made a test with your functionality from OvsPartialCopyNBL, and checked the cloned NBL against original NBL:
original NBL: 0xffffe000`02e74b20
cloned NBL: 0xffffe000`02a53ac0 (ok)

NBL's FirstNetBuffer
original FirstNetBuffer: 0xffffe000`02e74c80; DataOffset = 0x26; DataLength = 0x5c
cloned FirstNetBuffer: 0xffffe000`02fe4890; DataOffset = 0; DataLength = 0x5c (ok)

CurrentMdl of the FirstNetBuffer
original CurrentMdl: 0xffffe000`02e74da0; ByteCount = 0x50; (used space starting after 0x26 bytes, see above); Next MDL count bytes = 0x32; (0x50 - 0x26) + 0x32 = 0x5c
cloned CurrentMdl: 0xffffe000`02e74dc6; ByteCount = 0x2a; Next MDL count bytes = 0x32; total size from MDLs = 0x5c (NOT OK -- see below)

If you study CurrentMdl of both, you will see:
original: is spanning from ...74da0 -> ...74DF0 (i.e. + 0x50), but excluding the unused area: (offset = 0x26): 74DC6 -> 74DF0
cloned: is spanning from ...74DC6 -> ...74DF0  (i.e. + 0x32).
BOTH MDLS MAP THE SAME MEMORY AREA! So writing to cloneNb also writes to originalNb!

Did you study the layout of NBLs, NBs and MDLs and buffers in your memory management functions?

Sam

________________________________________
From: Saurabh Shah [ssaurabh at vmware.com]
Sent: Wednesday, August 06, 2014 12:50 AM
To: Samuel Ghinet
Cc: dev at openvswitch.org
Subject: RE: Cloning packets for "action: set field"

Hi Samuel,

> -----Original Message-----
> From: Samuel Ghinet [mailto:sghinet at cloudbasesolutions.com]
> Sent: Tuesday, August 05, 2014 5:00 AM
> To: Saurabh Shah
> Cc: dev at openvswitch.org
> Subject: RE: Cloning packets for "action: set field"
>
> Hello Saurabh,
>
> I believe there are more complexities associated to this packet cloning.
> Consider for instance the following flow:
> if (conditions on packet) =>
> out_to_port: vm1;
> set ip_dest: X1; out_to_port: vm2
> set ip_dest: X2; out_to_port: vm3
>
> Given that multiple destinations are set, and only then the packet is
> outputted, then, given the issue with cloning, all packets (that will be
> outputted to vm1, vm2, vm3) will have the same, last destination ip.
>

For the set case, we actually send out the packet to all the existing output ports before going ahead with the set. This is what the function OvsOutputBeforeSetAction does. So, at any given point we only have one NBL to deal with.

> I believe we cannot actually copy only the headers: if the packet contains a
> single MDL, there is nothing else we can do but to duplicate the packet.
> For the case where the NB has multiple MDLs chained, we must copy the first;
> MDL(s), i.e. the MDL or MDLs that contain the headers that must be copied.
>

We actually can copy parts of the header (even if they are in a single MDL), in fact our code has the infrastructure to do this. Please take a look at OvsPartialCopyNBL.

> There is also a problem with NBL completion (say we do the MDL copy and
> replacement):
> for the case above, consider:
> out_to_port: vm1;  (NB1)
> set ip_dest: X1; out_to_port: vm2 (clone to NB2, create new MDL, store ptr
> to original MDL) set ip_dest: X2; out_to_port: vm3(clone to NB3, create new
> MDL, store ptr to original MDL)
>
> At completion, we may need to do this:
> NB3: Put back old MDL into original NB, free clone NB3
> NB2: Put back old MDL (but already done before), and free NB2
> NB1: it's the original, so nothing special.
>
> In order to implement this "but already done before" case, we need to use
> ref counting for NBL context data (which I have seen you do have some use
> of ref counting) However, consider the complexity of restoring an original NB
> when multiple MDLs have been replaced, by multiple clones. The big
> problem here is that the whole MDL link would be broken.
>

The Buffer Management subsystem is one of the major difference between both the drivers. Since doing a deep copy of all packets is pretty sub-optimal, we (Guolin, in particular) designed the sophisticated buffer management system. Yes, it is intricate at first, but we already have it! (And for a while now. :)) So, all that you have mentioned here has already been implemented. Please take a look at our Buffer Management code & let me know if something is still unclear.

> I strongly believe that building this component with the mind set "optimize
> first, fix bugs later" is not at all a good way to do it. Plus, it would make the
> code more and more intrincate, as we continue to both find new problems
> and we get to have large pieces of code (dealing with NBLs) spread all over.
> I personally believe a better approach would be "start with something
> simpler, which is fully functional even though suboptimal, and then continue
> by building on top of it, improving incrementally for each of various cases".
>

Quite the contrary. The whole point of implementing the buffer management code was to layout the right infrastructure to help simplify NBL management for all the consumers & avoid doing a major enhancement at a later point. I can go off on a rant explaining our rationales in detail, but to allay your concerns, let me assure you that we don't have a mindset of optimize-over-bugfix. Once the Netlink integration is complete, I expect all of us to work towards closing all critical bugs at a priority.

> Thanks!
> Sam

HTH,
Saurabh



More information about the dev mailing list