[ovs-dev] [PATCH V3] netdev-dpdk: Add Jumbo Frame Support.
Kavanagh, Mark B
mark.b.kavanagh at intel.com
Thu Feb 11 10:46:40 UTC 2016
>
>
>On 08/02/2016 03:23, "Kavanagh, Mark B" <mark.b.kavanagh at intel.com> wrote:
>
>>>
>>>Hi Mark,
>>>
>>
>>Hi Daniele,
>>
>>Thanks for the review! Responses inline below.
>>
>>Cheers,
>>Mark
>>
>>>This patch besides adding Jumbo Frame support also cleans up
>>>the mbuf initialization (by changing the macros, adding
>>>dpdk_buf_size, and rewriting __ovs_rte_pktmbuf_init), so thanks
>>>for this. I think it makes sense to split the patch in two:
>>>one that does the clenup, and one that allows configuring the
>>>MTU.
>>
>>Okay, sounds good - I'll spin another version, splitting into a two patch
>>set.
>
>Thanks!
>
>>
>>>
>>>I agree with Flavio's comments as well, more inline
>>>
>>>Thanks
>>>
>>>On 01/02/2016 02:18, "Mark Kavanagh" <mark.b.kavanagh at intel.com> wrote:
>>>
>>>>Add support for Jumbo Frames to DPDK-enabled port types,
>>>>using single-segment-mbufs.
>>>>
>>>>Using this approach, the amount of memory allocated for each mbuf
>>>>to store frame data is increased to a value greater than 1518B
>>>>(typical Ethernet maximum frame length). The increased space
>>>>available in the mbuf means that an entire Jumbo Frame can be carried
>>>>in a single mbuf, as opposed to partitioning it across multiple mbuf
>>>>segments.
>>>>
>>>>The amount of space allocated to each mbuf to hold frame data is
>>>>defined dynamically by the user when adding a DPDK port to a bridge.
>>>>If an MTU value is not supplied, or the user-supplied value is invalid,
>>>>the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>>>>
>>>>Signed-off-by: Mark Kavanagh <mark.b.kavanagh at intel.com>
>>>>---
>>>> INSTALL.DPDK.md | 59 +++++++++-
>>>> NEWS | 2 +
>>>> lib/netdev-dpdk.c | 347
>>>>+++++++++++++++++++++++++++++++++++++++++-------------
>>>> 3 files changed, 328 insertions(+), 80 deletions(-)
>>>>
>>>>diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>>>>index 96b686c..64ccd15 100644
>>>>--- a/INSTALL.DPDK.md
>>>>+++ b/INSTALL.DPDK.md
>>>>@@ -859,10 +859,61 @@ by adding the following string:
>>>> to <interface> sections of all network devices used by DPDK. Parameter
>>>>'N'
>>>> determines how many queues can be used by the guest.
>>>>
>>>>+
>>>>+Jumbo Frames
>>>>+------------
>>>>+
>>>>+Support for Jumbo Frames may be enabled at run-time for DPDK-type
>>>>ports.
>>>>+
>>>>+To avail of Jumbo Frame support, add the 'dpdk-mtu' option to the
>>>>ovs-vsctl
>>>>+'add-port' command-line, along with the required MTU for the port.
>>>>+e.g.
>>>>+
>>>>+ ```
>>>>+ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
>>>>options:dpdk-mtu=9000
>>>>+ ```
>>>>+
>>>>+When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments
>>>>are
>>>>+increased, such that a full Jumbo Frame may be accommodated inside a
>>>>single
>>>>+mbuf segment. Once set, the MTU for a DPDK port is immutable.
>>>
>>>Why is it immutable?
>>
>>I guess my rationale here is that an MTU change can't be triggered via
>>OVS command-line, nor can it be triggered programmatically via DPDK
>>(apart from an explicit call to rte_eth_dev_set_mtu).
>>So, while technically it's possibly, from a user's point of view, there's
>>no way to configure it, outside of modifying the code directly.
>>If I've missed something here, please let me know.
>
>With
>
>ovs-vsctl set Interface options:dpdk-mtu=...
>
>you could change it at runtime. Have you considered that in this patch?
>I guess it's ok to have it configurable only when the port is added for
>now,
I'd actually only considered 'ovs-vsctl set Interface dpdk0 mtu XYZ'.
Since the MTU field in the Interface table is a status field (i.e. essentially read-only), updating the MTU in this way wouldn't have been possible.
Unfortunately, the same can't be said for the 'options' field - I'll need to look at this before I push a new version of the patch.
Thanks for the catch!
>but we need to make sure that nothing bad happens if a user tries to change
>it when it's running.
>
>Also, after talking with other people, I think that a better name for the
>parameter might be 'mtu_request',
No problem, I'll rename it.
and we might want to use it also for
>kernel
>ports (but that doesn't need to be done now).
>
>
>>
>>>
>>>>+
>>>>+Jumbo frame support has been validated against 13312B frames, using the
>>>>+DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
>>>>+theoretically be supported. Supported port types excludes vHost-Cuse
>>>>ports, as
>>>>+this feature is pending deprecation.
>>>>+
>>>>+
>>>>+vHost Ports and Jumbo Frames
>>>>+----------------------------
>>>>+Jumbo frame support is available for DPDK vHost-User ports only. Some
>>>>additional
>>>>+configuration is needed to take advantage of this feature:
>>>>+
>>>>+ 1. `mergeable buffers` must be enabled for vHost ports, as
>>>>demonstrated in
>>>>+ the QEMU command line snippet below:
>>>>+
>>>>+ ```
>>>>+ '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>>>>+ '-device
>>>>virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>>>>+ ```
>>>>+
>>>>+ 2. Where virtio devices are bound to the Linux kernel driver in a
>>>>guest
>>>>+ environment (i.e. interfaces are not bound to an in-guest DPDK
>>>>driver), the
>>>>+ MTU of those logical network interfaces must also be increased.
>>>>This
>>>>+ avoids segmentation of Jumbo Frames in the guest. Note that 'MTU'
>>>>refers
>>>>+ to the length of the IP packet only, and not that of the entire
>>>>frame.
>>>>+
>>>>+ e.g. To calculate the exact MTU of a standard IPv4 frame, subtract
>>>>the L2
>>>>+ header and CRC lengths (i.e. 18B) from the max supported frame
>>>>size.
>>>>+ So, to set the MTU for a 13312B Jumbo Frame:
>>>>+
>>>>+ ```
>>>>+ ifconfig eth1 mtu 13294
>>>>+ ```
>>>>+
>>>>+
>>>> Restrictions:
>>>> -------------
>>>>
>>>>- - Work with 1500 MTU, needs few changes in DPDK lib to fix this
>>>>issue.
>>>> - Currently DPDK port does not make use any offload functionality.
>>>> - DPDK-vHost support works with 1G huge pages.
>>>>
>>>>@@ -903,6 +954,12 @@ Restrictions:
>>>> the next release of DPDK (which includes the above patch) is
>>>>available and
>>>> integrated into OVS.
>>>>
>>>>+ Jumbo Frames:
>>>>+ - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The
>>>>source of
>>>>+ this issue is currently being investigated.
>>>>+ - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse
>>>>ports.
>>>>+
>>>>+
>>>> Bug Reporting:
>>>> --------------
>>>>
>>>>diff --git a/NEWS b/NEWS
>>>>index 5c18867..cd563e0 100644
>>>>--- a/NEWS
>>>>+++ b/NEWS
>>>>@@ -46,6 +46,8 @@ v2.5.0 - xx xxx xxxx
>>>> abstractions, such as virtual L2 and L3 overlays and security
>>>>groups.
>>>> - RHEL packaging:
>>>> * DPDK ports may now be created via network scripts (see
>>>>README.RHEL).
>>>>+ - netdev-dpdk:
>>>>+ * Add Jumbo Frame Support
>>>
>>>I don't think we should add this to 2.5
>>
>>Out of curiosity, is this mainly due to potential instability?
>>In any event, I'll move it into the 'Post-v2.5.0' section.
>>
>>>
>>>>
>>>>
>>>> v2.4.0 - 20 Aug 2015
>>>>diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>>index de7e488..76a5dcc 100644
>>>>--- a/lib/netdev-dpdk.c
>>>>+++ b/lib/netdev-dpdk.c
>>>>@@ -62,20 +62,25 @@ static struct vlog_rate_limit rl =
>>>>VLOG_RATE_LIMIT_INIT(5, 20);
>>>> #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>>>> #define OVS_VPORT_DPDK "ovs_dpdk"
>>>>
>>>>+#define NETDEV_DPDK_JUMBO_FRAME_ENABLED 1
>>>
>>>I don't see particular value in the above #define.
>>
>>I've removed this macro, and the section of code that used it in the
>>latest version of the patch.
>>
>>>
>>>>+#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE 1024
>>>>+
>>>> /*
>>>> * need to reserve tons of extra space in the mbufs so we can align the
>>>> * DMA addresses to 4KB.
>>>> * The minimum mbuf size is limited to avoid scatter behaviour and drop
>>>>in
>>>> * performance for standard Ethernet MTU.
>>>> */
>>>>-#define MTU_TO_MAX_LEN(mtu) ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>>>>-#define MBUF_SIZE_MTU(mtu) (MTU_TO_MAX_LEN(mtu) \
>>>>- + sizeof(struct dp_packet) \
>>>>- + RTE_PKTMBUF_HEADROOM)
>>>>-#define MBUF_SIZE_DRIVER (2048 \
>>>>- + sizeof (struct rte_mbuf) \
>>>>- + RTE_PKTMBUF_HEADROOM)
>>>>-#define MBUF_SIZE(mtu) MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
>>>>+#define MTU_TO_FRAME_LEN(mtu) ((mtu) + ETHER_HDR_LEN +
>>>>ETHER_CRC_LEN)
>>>>+#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN -
>>>>ETHER_CRC_LEN)
>>>>+#define MBUF_SEGMENT_SIZE(mtu) ( MTU_TO_FRAME_LEN(mtu) \
>>>>+ + sizeof(struct dp_packet) \
>>>>+ + RTE_PKTMBUF_HEADROOM)
>>>>+
>>>>+/* This value should be specified as a multiple of the DPDK NIC
>>>>driver's
>>>>+ * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
>>>>+ */
>>>>+#define NETDEV_DPDK_MAX_FRAME_LEN 13312
>>>>
>>>> /* Max and min number of packets in the mempool. OVS tries to
>>>>allocate a
>>>> * mempool with MAX_NB_MBUF: if this fails (because the system doesn't
>>>>have
>>>>@@ -86,6 +91,8 @@ static struct vlog_rate_limit rl =
>>>>VLOG_RATE_LIMIT_INIT(5, 20);
>>>> #define MIN_NB_MBUF (4096 * 4)
>>>> #define MP_CACHE_SZ RTE_MEMPOOL_CACHE_MAX_SIZE
>>>>
>>>>+#define DPDK_VLAN_TAG_LEN 4
>>>>+
>>>> /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>>> BUILD_ASSERT_DECL(MAX_NB_MBUF %
>>>>ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF)
>>>>== 0);
>>>>
>>>>@@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>>>> .header_split = 0, /* Header Split disabled */
>>>> .hw_ip_checksum = 0, /* IP checksum offload disabled */
>>>> .hw_vlan_filter = 0, /* VLAN filtering disabled */
>>>>- .jumbo_frame = 0, /* Jumbo Frame Support disabled */
>>>> .hw_strip_crc = 0,
>>>> },
>>>> .rx_adv_conf = {
>>>>@@ -254,6 +260,41 @@ is_dpdk_class(const struct netdev_class *class)
>>>> return class->construct == netdev_dpdk_construct;
>>>> }
>>>>
>>>>+/* DPDK NIC drivers allocate RX buffers at a particular granularity
>>>>+ * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for
>>>>igb_uio).
>>>>+ * If 'frame_len' is not a multiple of this value, insufficient buffers
>>>>are
>>>>+ * allocated to accomodate the packet in its entirety. Furthermore, the
>>>>igb_uio
>>>>+ * driver needs to ensure that there is also sufficient space in the Rx
>>>>buffer
>>>>+ * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is
>>>>too
>>>>+ * small, then the driver enables scatter RX behaviour, which reduces
>>>>+ * performance. To prevent this, use a buffer size that is closest to
>>>>+ * 'frame_len', but which satisfies the aforementioned criteria.
>>>>+ */
>>>>+static uint32_t
>>>>+dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
>>>>+{
>>>>+ struct rte_eth_dev_info info;
>>>>+ uint32_t buf_size;
>>>>+ /* XXX: This is a workaround for DPDK v2.2, and needs to be
>>>>refactored with a
>>>>+ * future DPDK release. */
>>>
>>>Could you elaborate on that?
>>
>>Due to changes pending in the latest version of the code, this is no
>>longer relevant and will be removed.
>>
>>>
>>>>+ uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
>>>>+
>>>>+ if(netdev->type == DPDK_DEV_ETH) {
>>>>+ rte_eth_dev_info_get(netdev->port_id, &info);
>>>>+ buf_size = (info.min_rx_bufsize == 0) ?
>>>>+ NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
>>>>+ info.min_rx_bufsize;
>>>>+ } else {
>>>>+ buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
>>>>+ }
>>>>+
>>>>+ if(len % buf_size) {
>>>>+ len = buf_size * ((len/buf_size) + 1);
>>>>+ }
>>>
>>>I think this looks better with the ROUND_UP macro.
>>
>>True - I'll update the code accordingly.
>>
>>>
>>>>+
>>>>+ return len;
>>>>+}
>>>>+
>>>> /* XXX: use dpdk malloc for entire OVS. in fact huge page should be
>>>>used
>>>> * for all other segments data, bss and text. */
>>>>
>>>>@@ -280,26 +321,65 @@ free_dpdk_buf(struct dp_packet *p)
>>>> }
>>>>
>>>> static void
>>>>-__rte_pktmbuf_init(struct rte_mempool *mp,
>>>>- void *opaque_arg OVS_UNUSED,
>>>>- void *_m,
>>>>- unsigned i OVS_UNUSED)
>>>>+ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>>>> {
>>>>- struct rte_mbuf *m = _m;
>>>>- uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>>>>+ struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
>>>>+ struct rte_pktmbuf_pool_private default_mbp_priv;
>>>>+ uint16_t roomsz;
>>>>
>>>> RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>>>
>>>>+ /* if no structure is provided, assume no mbuf private area */
>>>>+
>>>>+ user_mbp_priv = opaque_arg;
>>>>+ if (user_mbp_priv == NULL) {
>>>>+ default_mbp_priv.mbuf_priv_size = 0;
>>>>+ if (mp->elt_size > sizeof(struct dp_packet)) {
>>>>+ roomsz = mp->elt_size - sizeof(struct dp_packet);
>>>>+ } else {
>>>>+ roomsz = 0;
>>>>+ }
>>>>+ default_mbp_priv.mbuf_data_room_size = roomsz;
>>>>+ user_mbp_priv = &default_mbp_priv;
>>>>+ }
>>>>+
>>>>+ RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
>>>>+ user_mbp_priv->mbuf_data_room_size +
>>>>+ user_mbp_priv->mbuf_priv_size);
>>>>+
>>>>+ mbp_priv = rte_mempool_get_priv(mp);
>>>>+ memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
>>>>+}
>>>>+
>>>>+/* Initialise some fields in the mbuf structure that are not modified
>>>>by
>>>>the
>>>>+ * user once created (origin pool, buffer start address, etc.*/
>>>>+static void
>>>>+__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>+ void *opaque_arg OVS_UNUSED,
>>>>+ void *_m,
>>>>+ unsigned i OVS_UNUSED)
>>>>+{
>>>>+ struct rte_mbuf *m = _m;
>>>>+ uint32_t buf_size, buf_len, priv_size;
>>>>+
>>>>+ priv_size = rte_pktmbuf_priv_size(mp);
>>>>+ buf_size = sizeof(struct dp_packet) + priv_size;
>>>>+ buf_len = rte_pktmbuf_data_room_size(mp);
>>>>+
>>>>+ RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) ==
>>>>priv_size);
>>>>+ RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
>>>>+ RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>>>>+
>>>> memset(m, 0, mp->elt_size);
>>>>
>>>>- /* start of buffer is just after mbuf structure */
>>>>- m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>>>- m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>>>>- sizeof(struct dp_packet);
>>>>+ /* start of buffer is after dp_packet structure and priv data */
>>>>+ m->priv_size = priv_size;
>>>>+ m->buf_addr = (char *)m + buf_size;
>>>>+ m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>>>> m->buf_len = (uint16_t)buf_len;
>>>>
>>>> /* keep some headroom between start of buffer and data */
>>>>- m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>>>>+ m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
>>>>
>>>> /* init some constant fields */
>>>> m->pool = mp;
>>>>@@ -315,7 +395,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>> {
>>>> struct rte_mbuf *m = _m;
>>>>
>>>>- __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>>>+ __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>>>>
>>>> dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>>>> }
>>>>@@ -326,6 +406,7 @@ dpdk_mp_get(int socket_id, int mtu)
>>>>OVS_REQUIRES(dpdk_mutex)
>>>> struct dpdk_mp *dmp = NULL;
>>>> char mp_name[RTE_MEMPOOL_NAMESIZE];
>>>> unsigned mp_size;
>>>>+ struct rte_pktmbuf_pool_private mbp_priv;
>>>>
>>>> LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>>> if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>>>>@@ -338,6 +419,8 @@ dpdk_mp_get(int socket_id, int mtu)
>>>>OVS_REQUIRES(dpdk_mutex)
>>>> dmp->socket_id = socket_id;
>>>> dmp->mtu = mtu;
>>>> dmp->refcount = 1;
>>>>+ mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) +
>>>>RTE_PKTMBUF_HEADROOM;
>>>>+ mbp_priv.mbuf_priv_size = 0;
>>>>
>>>> mp_size = MAX_NB_MBUF;
>>>> do {
>>>>@@ -346,10 +429,10 @@ dpdk_mp_get(int socket_id, int mtu)
>>>>OVS_REQUIRES(dpdk_mutex)
>>>> return NULL;
>>>> }
>>>>
>>>>- dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>>>>+ dmp->mp = rte_mempool_create(mp_name, mp_size,
>>>>MBUF_SEGMENT_SIZE(mtu),
>>>> MP_CACHE_SZ,
>>>> sizeof(struct
>>>>rte_pktmbuf_pool_private),
>>>>- rte_pktmbuf_pool_init, NULL,
>>>>+ ovs_rte_pktmbuf_pool_init,
>>>>&mbp_priv,
>>>> ovs_rte_pktmbuf_init, NULL,
>>>> socket_id, 0);
>>>> } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>>>>MIN_NB_MBUF);
>>>
>>>Ok, this is quite intricated. I believe the reason that OVS used its own
>>>__rte_pktmbuf_init() was that it wasn't possible to have custom metadata
>>>in mbufs before DPDK 2.1.
>>>
>>>If I'm not mistaken, with DPDK commit b507905ff407 there's a way to do
>>>that
>>>without copying any code from DPDK with the following incremental (from
>>>master)
>>>
>>>---8<---
>>>
>>>@@ -278,42 +272,12 @@ free_dpdk_buf(struct dp_packet *p)
>>> }
>>>
>>> static void
>>>-__rte_pktmbuf_init(struct rte_mempool *mp,
>>>- void *opaque_arg OVS_UNUSED,
>>>- void *_m,
>>>- unsigned i OVS_UNUSED)
>>>-{
>>>- struct rte_mbuf *m = _m;
>>>- uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>>>-
>>>- RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>>-
>>>- memset(m, 0, mp->elt_size);
>>>-
>>>- /* start of buffer is just after mbuf structure */
>>>- m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>>- m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>>>- sizeof(struct dp_packet);
>>>- m->buf_len = (uint16_t)buf_len;
>>>-
>>>- /* keep some headroom between start of buffer and data */
>>>- m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>>>-
>>>- /* init some constant fields */
>>>- m->pool = mp;
>>>- m->nb_segs = 1;
>>>- m->port = 0xff;
>>>-}
>>>-
>>>-static void
>>>-ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>- void *opaque_arg OVS_UNUSED,
>>>- void *_m,
>>>- unsigned i OVS_UNUSED)
>>>+ovs_rte_pktmbuf_init(struct rte_mempool *mp, void *opaque_arg,
>>>+ void *_m, unsigned i)
>>> {
>>> struct rte_mbuf *m = _m;
>>>
>>>- __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>>+ rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>>
>>> dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>>> }
>>>@@ -339,15 +303,21 @@ dpdk_mp_get(int socket_id, int mtu)
>>>OVS_REQUIRES(dpdk_mutex)
>>>
>>> mp_size = MAX_NB_MBUF;
>>> do {
>>>+ struct rte_pktmbuf_pool_private mbuf_sizes;
>>>+
>>> if (snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d_%d_%u",
>>> dmp->mtu, dmp->socket_id, mp_size) < 0) {
>>> return NULL;
>>> }
>>>
>>>+ mbuf_sizes.mbuf_priv_size = sizeof (struct dp_packet)
>>>+ - sizeof (struct rte_mbuf);
>>>+ mbuf_sizes.mbuf_data_room_size = MBUF_SIZE(mtu)
>>>+ - sizeof(struct dp_packet);
>>> dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>>> MP_CACHE_SZ,
>>> sizeof(struct
>>>rte_pktmbuf_pool_private),
>>>- rte_pktmbuf_pool_init, NULL,
>>>+ rte_pktmbuf_pool_init, &mbuf_sizes,
>>> ovs_rte_pktmbuf_init, NULL,
>>> socket_id, 0);
>>> } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>>>MIN_NB_MBUF);
>>>
>>>---8<---
>>>
>>>I think this will make the patch cleaner. Do you think this will work
>>>with
>>>the increased MTU as well?
>>
>>Thanks for this Daniele. When I implemented the code, I attempted to
>>change as little of the legacy code as possible, but seeing as how I'm
>>now doing a patchset, I can roll this change in.
>>Btw, I just tested with P2P 9k frames, and it works fine :)
>>
>>>
>>>>@@ -433,6 +516,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev,
>>>>int
>>>>n_rxq, int n_txq)
>>>> {
>>>> int diag = 0;
>>>> int i;
>>>>+ struct rte_eth_conf conf = port_conf;
>>>>
>>>> /* A device may report more queues than it makes available (this
>>>>has
>>>> * been observed for Intel xl710, which reserves some of them for
>>>>@@ -444,7 +528,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev,
>>>>int n_rxq, int n_txq)
>>>> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq,
>>>>n_txq);
>>>> }
>>>>
>>>>- diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>>>>&port_conf);
>>>>+ if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
>>>>+ conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
>>>>+ conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>+ }
>>>>+
>>>>+ diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>>>>&conf);
>>>> if (diag) {
>>>> break;
>>>> }
>>>>@@ -586,6 +675,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned
>>>>int
>>>>port_no,
>>>> struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>>> int sid;
>>>> int err = 0;
>>>>+ uint32_t buf_size;
>>>>
>>>> ovs_mutex_init(&netdev->mutex);
>>>> ovs_mutex_lock(&netdev->mutex);
>>>>@@ -605,10 +695,16 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned
>>>>int port_no,
>>>> netdev->port_id = port_no;
>>>> netdev->type = type;
>>>> netdev->flags = 0;
>>>>+
>>>>+ /* Initialize port's MTU and frame len to the default Ethernet
>>>>values.
>>>>+ * Larger, user-specified (jumbo) frame buffers are accommodated in
>>>>+ * netdev_dpdk_set_config.
>>>>+ */
>>>>+ netdev->max_packet_len = ETHER_MAX_LEN;
>>>> netdev->mtu = ETHER_MTU;
>>>>- netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>>>>
>>>>- netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
>>>>+ buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
>>>>+ netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>>>>FRAME_LEN_TO_MTU(buf_size));
>>>> if (!netdev->dpdk_mp) {
>>>> err = ENOMEM;
>>>> goto unlock;
>>>>@@ -651,6 +747,27 @@ dpdk_dev_parse_name(const char dev_name[], const
>>>>char prefix[],
>>>> return 0;
>>>> }
>>>>
>>>>+static void
>>>>+dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
>>>>+{
>>>>+ const char *mtu_str = smap_get(args, "dpdk-mtu");
>>>>+ char *end_ptr = NULL;
>>>>+ int local_mtu;
>>>>+
>>>>+ if(mtu_str) {
>>>>+ local_mtu = strtoul(mtu_str, &end_ptr, 0);
>>>>+ }
>>>>+ if(!mtu_str || local_mtu < ETHER_MTU ||
>>>>+ local_mtu >
>>>>FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN) ||
>>>>+ *end_ptr != '\0') {
>>>>+ local_mtu = ETHER_MTU;
>>>>+ VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting
>>>>to
>>>>%d.\n",
>>>>+ local_mtu);
>>>>+ }
>>>>+
>>>>+ *mtu = local_mtu;
>>>>+}
>>>>+
>>>> static int
>>>> vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
>>>> {
>>>>@@ -777,11 +894,77 @@ netdev_dpdk_get_config(const struct netdev
>>>>*netdev_, struct smap *args)
>>>> smap_add_format(args, "configured_rx_queues", "%d",
>>>>netdev_->n_rxq);
>>>> smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
>>>> smap_add_format(args, "configured_tx_queues", "%d",
>>>>dev->real_n_txq);
>>>>+ smap_add_format(args, "mtu", "%d", dev->mtu);
>>>> ovs_mutex_unlock(&dev->mutex);
>>>>
>>>> return 0;
>>>> }
>>>>
>>>>+/* Set the mtu of DPDK_DEV_ETH ports */
>>>>+static int
>>>>+netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>+{
>>>>+ struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>+ int old_mtu, err;
>>>>+ uint32_t buf_size;
>>>>+ int dpdk_mtu;
>>>>+ struct dpdk_mp *old_mp;
>>>>+ struct dpdk_mp *mp;
>>>>+
>>>>+ ovs_mutex_lock(&dpdk_mutex);
>>>>+ ovs_mutex_lock(&dev->mutex);
>>>>+ if (dev->mtu == mtu) {
>>>>+ err = 0;
>>>>+ goto out;
>>>>+ }
>>>>+
>>>>+ buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
>>>>+ dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
>>>>+
>>>>+ mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
>>>>+ if (!mp) {
>>>>+ err = ENOMEM;
>>>>+ goto out;
>>>>+ }
>>>>+
>>>>+ rte_eth_dev_stop(dev->port_id);
>>>>+
>>>>+ old_mtu = dev->mtu;
>>>>+ old_mp = dev->dpdk_mp;
>>>>+ dev->dpdk_mp = mp;
>>>>+ dev->mtu = mtu;
>>>>+ dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>+
>>>>+ err = dpdk_eth_dev_init(dev);
>>>>+ if (err) {
>>>>+ VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last
>>>>known "
>>>>+ "good value '%d'\n", mtu, dev->up.name, old_mtu);
>>>>+ dpdk_mp_put(mp);
>>>>+ dev->mtu = old_mtu;
>>>>+ dev->dpdk_mp = old_mp;
>>>>+ dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>+ dpdk_eth_dev_init(dev);
>>>>+ goto out;
>>>>+ } else {
>>>>+ dpdk_mp_put(old_mp);
>>>>+ netdev_change_seq_changed(netdev);
>>>>+ }
>>>>+out:
>>>>+ ovs_mutex_unlock(&dev->mutex);
>>>>+ ovs_mutex_unlock(&dpdk_mutex);
>>>>+ return err;
>>>>+}
>>>>+
>>>>+static int
>>>>+netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
>>>>+{
>>>>+ int mtu;
>>>>+
>>>>+ dpdk_dev_parse_mtu(args, &mtu);
>>>>+
>>>>+ return netdev_dpdk_set_mtu(netdev_, mtu);
>>>>+}
>>>>+
>>>> static int
>>>> netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>>>> {
>>>>@@ -1358,54 +1541,6 @@ netdev_dpdk_get_mtu(const struct netdev *netdev,
>>>>int *mtup)
>>>>
>>>> return 0;
>>>> }
>>>>-
>>>>-static int
>>>>-netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>-{
>>>>- struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>- int old_mtu, err;
>>>>- struct dpdk_mp *old_mp;
>>>>- struct dpdk_mp *mp;
>>>>-
>>>>- ovs_mutex_lock(&dpdk_mutex);
>>>>- ovs_mutex_lock(&dev->mutex);
>>>>- if (dev->mtu == mtu) {
>>>>- err = 0;
>>>>- goto out;
>>>>- }
>>>>-
>>>>- mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>>>>- if (!mp) {
>>>>- err = ENOMEM;
>>>>- goto out;
>>>>- }
>>>>-
>>>>- rte_eth_dev_stop(dev->port_id);
>>>>-
>>>>- old_mtu = dev->mtu;
>>>>- old_mp = dev->dpdk_mp;
>>>>- dev->dpdk_mp = mp;
>>>>- dev->mtu = mtu;
>>>>- dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>-
>>>>- err = dpdk_eth_dev_init(dev);
>>>>- if (err) {
>>>>- dpdk_mp_put(mp);
>>>>- dev->mtu = old_mtu;
>>>>- dev->dpdk_mp = old_mp;
>>>>- dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>- dpdk_eth_dev_init(dev);
>>>>- goto out;
>>>>- }
>>>>-
>>>>- dpdk_mp_put(old_mp);
>>>>- netdev_change_seq_changed(netdev);
>>>>-out:
>>>>- ovs_mutex_unlock(&dev->mutex);
>>>>- ovs_mutex_unlock(&dpdk_mutex);
>>>>- return err;
>>>>-}
>>>>-
>>>> static int
>>>> netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>>>>
>>>>@@ -1682,7 +1817,7 @@ netdev_dpdk_get_status(const struct netdev
>>>>*netdev_, struct smap *args)
>>>> smap_add_format(args, "numa_id", "%d",
>>>>rte_eth_dev_socket_id(dev->port_id));
>>>> smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>>>> smap_add_format(args, "min_rx_bufsize", "%u",
>>>>dev_info.min_rx_bufsize);
>>>>- smap_add_format(args, "max_rx_pktlen", "%u",
>>>>dev_info.max_rx_pktlen);
>>>>+ smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>>>> smap_add_format(args, "max_rx_queues", "%u",
>>>>dev_info.max_rx_queues);
>>>> smap_add_format(args, "max_tx_queues", "%u",
>>>>dev_info.max_tx_queues);
>>>> smap_add_format(args, "max_mac_addrs", "%u",
>>>>dev_info.max_mac_addrs);
>>>>@@ -1904,6 +2039,51 @@ dpdk_vhost_user_class_init(void)
>>>> return 0;
>>>> }
>>>>
>>>>+/* Set the mtu of DPDK_DEV_VHOST ports */
>>>>+static int
>>>>+netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
>>>>+{
>>>>+ struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>+ int err = 0;
>>>>+ struct dpdk_mp *old_mp;
>>>>+ struct dpdk_mp *mp;
>>>>+
>>>>+ ovs_mutex_lock(&dpdk_mutex);
>>>>+ ovs_mutex_lock(&dev->mutex);
>>>>+ if (dev->mtu == mtu) {
>>>>+ err = 0;
>>>>+ goto out;
>>>>+ }
>>>>+
>>>>+ mp = dpdk_mp_get(dev->socket_id, mtu);
>>>>+ if (!mp) {
>>>>+ err = ENOMEM;
>>>>+ goto out;
>>>>+ }
>>>>+
>>>>+ old_mp = dev->dpdk_mp;
>>>>+ dev->dpdk_mp = mp;
>>>>+ dev->mtu = mtu;
>>>>+ dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>+
>>>>+ dpdk_mp_put(old_mp);
>>>>+ netdev_change_seq_changed(netdev);
>>>>+out:
>>>>+ ovs_mutex_unlock(&dev->mutex);
>>>>+ ovs_mutex_unlock(&dpdk_mutex);
>>>>+ return err;
>>>>+}
>>>>+
>>>>+static int
>>>>+netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap
>>>>*args)
>>>>+{
>>>>+ int mtu;
>>>>+
>>>>+ dpdk_dev_parse_mtu(args, &mtu);
>>>>+
>>>>+ return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
>>>>+}
>>>>+
>>>> static void
>>>> dpdk_common_init(void)
>>>> {
>>>>@@ -2040,8 +2220,9 @@ unlock_dpdk:
>>>> return err;
>>>> }
>>>>
>>>>-#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ,
>>>>SEND,
>>>>\
>>>>- GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)
>>>>\
>>>>+#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG,
>>>>\
>>>>+ MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,
>>>>\
>>>>+ GET_STATUS, RXQ_RECV)
>>>>\
>>>> { \
>>>> NAME, \
>>>> INIT, /* init */ \
>>>>@@ -2053,7 +2234,7 @@ unlock_dpdk:
>>>> DESTRUCT, \
>>>> netdev_dpdk_dealloc, \
>>>> netdev_dpdk_get_config, \
>>>>- NULL, /* netdev_dpdk_set_config */ \
>>>>+ SET_CONFIG, \
>>>> NULL, /* get_tunnel_config */ \
>>>> NULL, /* build header */ \
>>>> NULL, /* push header */ \
>>>>@@ -2067,7 +2248,7 @@ unlock_dpdk:
>>>> netdev_dpdk_set_etheraddr, \
>>>> netdev_dpdk_get_etheraddr, \
>>>> netdev_dpdk_get_mtu, \
>>>>- netdev_dpdk_set_mtu, \
>>>>+ SET_MTU, \
>>>> netdev_dpdk_get_ifindex, \
>>>> GET_CARRIER, \
>>>> netdev_dpdk_get_carrier_resets, \
>>>>@@ -2213,8 +2394,10 @@ static const struct netdev_class dpdk_class =
>>>> NULL,
>>>> netdev_dpdk_construct,
>>>> netdev_dpdk_destruct,
>>>>+ netdev_dpdk_set_config,
>>>> netdev_dpdk_set_multiq,
>>>> netdev_dpdk_eth_send,
>>>>+ netdev_dpdk_set_mtu,
>>>> netdev_dpdk_get_carrier,
>>>> netdev_dpdk_get_stats,
>>>> netdev_dpdk_get_features,
>>>>@@ -2227,8 +2410,10 @@ static const struct netdev_class dpdk_ring_class
>>>>=
>>>> NULL,
>>>> netdev_dpdk_ring_construct,
>>>> netdev_dpdk_destruct,
>>>>+ netdev_dpdk_set_config,
>>>> netdev_dpdk_set_multiq,
>>>> netdev_dpdk_ring_send,
>>>>+ netdev_dpdk_set_mtu,
>>>> netdev_dpdk_get_carrier,
>>>> netdev_dpdk_get_stats,
>>>> netdev_dpdk_get_features,
>>>>@@ -2241,8 +2426,10 @@ static const struct netdev_class OVS_UNUSED
>>>>dpdk_vhost_cuse_class =
>>>> dpdk_vhost_cuse_class_init,
>>>> netdev_dpdk_vhost_cuse_construct,
>>>> netdev_dpdk_vhost_destruct,
>>>>+ NULL,
>>>> netdev_dpdk_vhost_set_multiq,
>>>> netdev_dpdk_vhost_send,
>>>>+ NULL,
>>>> netdev_dpdk_vhost_get_carrier,
>>>> netdev_dpdk_vhost_get_stats,
>>>> NULL,
>>>>@@ -2255,8 +2442,10 @@ static const struct netdev_class OVS_UNUSED
>>>>dpdk_vhost_user_class =
>>>> dpdk_vhost_user_class_init,
>>>> netdev_dpdk_vhost_user_construct,
>>>> netdev_dpdk_vhost_destruct,
>>>>+ netdev_dpdk_vhost_set_config,
>>>> netdev_dpdk_vhost_set_multiq,
>>>> netdev_dpdk_vhost_send,
>>>>+ netdev_dpdk_vhost_set_mtu,
>>>> netdev_dpdk_vhost_get_carrier,
>>>> netdev_dpdk_vhost_get_stats,
>>>> NULL,
>>>>--
>>>>1.9.3
>>>>
>>>>_______________________________________________
>>>>dev mailing list
>>>>dev at openvswitch.org
>>>>http://openvswitch.org/mailman/listinfo/dev
>>
More information about the dev
mailing list