Eric Dumazet [Fri, 12 Feb 2016 00:28:50 +0000 (16:28 -0800)]
tcp/dccp: better use of ephemeral ports in bind()
Implement strategy used in __inet_hash_connect() in opposite way :
Try to find a candidate using odd ports, then fallback to even ports.
We no longer disable BH for whole traversal, but one bucket at a time.
We also use cond_resched() to yield cpu to other tasks if needed.
I removed one indentation level and tried to mirror the loop we have
in __inet_hash_connect() and variable names to ease code maintenance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 12 Feb 2016 00:28:49 +0000 (16:28 -0800)]
tcp/dccp: better use of ephemeral ports in connect()
In commit
07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range
in connect()"), I added a very simple heuristic, so that we got better
chances to use even ports, and allow bind() users to have more available
slots.
It gave nice results, but with more than 200,000 TCP sessions on a typical
server, the ~30,000 ephemeral ports are still a rare resource.
I chose to go a step further, by looking at all even ports, and if none
was available, fallback to odd ports.
The companion patch does the same in bind(), but in opposite way.
I've seen exec times of up to 30ms on busy servers, so I no longer
disable BH for the whole traversal, but only for each hash bucket.
I also call cond_resched() to be gentle to other tasks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 16:59:17 +0000 (11:59 -0500)]
Merge branch 'net-mitigate-kmem_free-slowpath'
Jesper Dangaard Brouer says:
====================
net: mitigating kmem_cache free slowpath
This patchset is the first real use-case for kmem_cache bulk _free_.
The use of bulk _alloc_ is NOT included in this patchset. The full use
have previously been posted here [1].
The bulk free side have the largest benefit for the network stack
use-case, because network stack is hitting the kmem_cache/SLUB
slowpath when freeing SKBs, due to the amount of outstanding SKBs.
This is solved by using the new API kmem_cache_free_bulk().
Introduce new API napi_consume_skb(), that hides/handles bulk freeing
for the caller. The drivers simply need to use this call when freeing
SKBs in NAPI context, e.g. replacing their calles to dev_kfree_skb() /
dev_consume_skb_any().
Driver ixgbe is the first user of this new API.
[1] http://thread.gmane.org/gmane.linux.network/384302/focus=397373
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Mon, 8 Feb 2016 12:15:09 +0000 (13:15 +0100)]
ixgbe: bulk free SKBs during TX completion cleanup cycle
There is an opportunity to bulk free SKBs during reclaiming of
resources after DMA transmit completes in ixgbe_clean_tx_irq. Thus,
bulk freeing at this point does not introduce any added latency.
Simply use napi_consume_skb() which were recently introduced. The
napi_budget parameter is needed by napi_consume_skb() to detect if it
is called from netpoll.
Benchmarking IPv4-forwarding, on CPU i7-4790K @4.2GHz (no turbo boost)
Single CPU/flow numbers: before:
1982144 pps -> after :
2064446 pps
Improvement: +82302 pps, -20 nanosec, +4.1%
(SLUB and GCC version 5.1.1
20150618 (Red Hat 5.1.1-4))
Joint work with Alexander Duyck.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Mon, 8 Feb 2016 12:15:04 +0000 (13:15 +0100)]
net: bulk free SKBs that were delay free'ed due to IRQ context
The network stack defers SKBs free, in-case free happens in IRQ or
when IRQs are disabled. This happens in __dev_kfree_skb_irq() that
writes SKBs that were free'ed during IRQ to the softirq completion
queue (softnet_data.completion_queue).
These SKBs are naturally delayed, and cleaned up during NET_TX_SOFTIRQ
in function net_tx_action(). Take advantage of this a use the skb
defer and flush API, as we are already in softirq context.
For modern drivers this rarely happens. Although most drivers do call
dev_kfree_skb_any(), which detects the situation and calls
__dev_kfree_skb_irq() when needed. This due to netpoll can call from
IRQ context.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Mon, 8 Feb 2016 12:14:59 +0000 (13:14 +0100)]
net: bulk free infrastructure for NAPI context, use napi_consume_skb
Discovered that network stack were hitting the kmem_cache/SLUB
slowpath when freeing SKBs. Doing bulk free with kmem_cache_free_bulk
can speedup this slowpath.
NAPI context is a bit special, lets take advantage of that for bulk
free'ing SKBs.
In NAPI context we are running in softirq, which gives us certain
protection. A softirq can run on several CPUs at once. BUT the
important part is a softirq will never preempt another softirq running
on the same CPU. This gives us the opportunity to access per-cpu
variables in softirq context.
Extend napi_alloc_cache (before only contained page_frag_cache) to be
a struct with a small array based stack for holding SKBs. Introduce a
SKB defer and flush API for accessing this.
Introduce napi_consume_skb() as replacement for e.g. dev_consume_skb_any()
when running in NAPI context. A small trick to handle/detect if we
are called from netpoll is to see if budget is 0. In that case, we
need to invoke dev_consume_skb_irq().
Joint work with Alexander Duyck.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 16:55:51 +0000 (11:55 -0500)]
Merge branch 'virtio_net-ethtool-validation'
Nikolay Aleksandrov says:
====================
virtio_net: better ethtool setting validation
This small set is a follow-up for the recent patches that added ethtool
get/set settings. Patch 1 changes the speed validation routine to check
if the speed is between 0 and INT_MAX (or SPEED_UNKNOWN) and patch 2 adds
port validation to virtio_net and better validation comment.
This set is on top of Michael's patch which explains that speeds from 0
to INT_MAX are valid:
http://patchwork.ozlabs.org/patch/578911/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Sun, 7 Feb 2016 20:52:24 +0000 (21:52 +0100)]
virtio_net: validate ethtool port setting and explain the user validation
We should validate the port setting that we got from the user and check
if it's what we've set it to (PORT_OTHER), also add explanation that
ignoring advertising is good as long as we don't have autonegotiation.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Sun, 7 Feb 2016 20:52:23 +0000 (21:52 +0100)]
ethtool: make validate_speed accept all speeds between 0 and INT_MAX
Devices these days can have any speed and as was recently pointed out
any speed from 0 to INT_MAX is valid so adjust speed validation to
accept such values.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 16:52:09 +0000 (11:52 -0500)]
Merge branch 'dp83848-TLK10x'
Andrew F. Davis says:
====================
net: phy: dp83848: Add support for TI TLK10x Ethernet PHYs
This series is [0] split into its logical components.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew F. Davis [Sun, 7 Feb 2016 17:47:21 +0000 (11:47 -0600)]
net: phy: dp83848: Add comments for register definitions
Signed-off-by: Andrew F. Davis <afd@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew F. Davis [Sun, 7 Feb 2016 17:47:20 +0000 (11:47 -0600)]
net: phy: dp83848: Add support for TI TLK10x Ethernet PHYs
The TI TLK10x Ethernet PHYs are similar in the interrupt relevant
registers and so are compatible with the DP83848x devices already
supported.
Signed-off-by: Andrew F. Davis <afd@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew F. Davis [Sun, 7 Feb 2016 17:47:19 +0000 (11:47 -0600)]
net: phy: dp83848: Reorganize code for readability and safety
Reorganize code by moving the desired interrupt mask definition
out of function. Also rearrange the enable/disable interrupt function
to prevent accidental over-writing of values in registers.
Signed-off-by: Andrew F. Davis <afd@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew F. Davis [Sun, 7 Feb 2016 17:47:18 +0000 (11:47 -0600)]
net: phy: dp83848: Add PHY ID for TI version of DP83848C
After acquiring National Semiconductor, TI appears to have
changed the Vendor Model Number for the DP83848C PHYs,
add this new ID to supported IDs.
Signed-off-by: Andrew F. Davis <afd@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew F. Davis [Sun, 7 Feb 2016 17:47:17 +0000 (11:47 -0600)]
net: phy: dp83848: Add macro for dp83848 compatible devices
Add a helper macro for defining dp83848 compatible phy devices.
Update copyright info.
Signed-off-by: Andrew F. Davis <afd@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Feb 2016 11:42:26 +0000 (12:42 +0100)]
be2net: don't report EVB for older chipsets when SR-IOV is disabled
The EVB (virtual bridge) functionality should be disabled on older BE3
and Lancer chips if SR-IOV is disabled in the NIC's BIOS. This setting
is identified by the zero value of total VFs reported by the card.
The GET_HSW_CONFIG command cannot be used as it is not supported by
these older chipset's FW.
v2: added the comment
Cc: Sathya Perla <sathya.perla@broadcom.com>
Cc: Ajit Khaparde <ajit.khaparde@broadcom.com>
Cc: Padmanabh Ratnakar <padmanabh.ratnakar@broadcom.com>
Cc: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Cc: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Acked-by: Sathya Perla <sathya.perla@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 16:34:31 +0000 (11:34 -0500)]
Merge branch 'spi_ks8995'
Helmut Buchsbaum says:
====================
Add support for MICREL KSZ8795CLX 5-port switch
This patch series refactors the spi-ks8995 driver to finally add support
for the MICREL KSZ8795CLX. Additionally support for controlling a GPIO
line for resetting the switch is added.
Helmut
Changes since v2:
- use GPIO_ACTIVE_LOW according to Andrew's remark.
- use ePAPR compliant node name in example, thanks to Sergei for
pointing out
Changes since v1:
- removed initializing registers from Device Tree following Florian's
advice
- fixed GPIO handling for reset according to Andrew's remark.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Helmut Buchsbaum [Tue, 9 Feb 2016 19:47:18 +0000 (20:47 +0100)]
dt-bindings: net: ks8995: add bindings documentation for ks8995
Signed-off-by: Helmut Buchsbaum <helmut.buchsbaum@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Helmut Buchsbaum [Tue, 9 Feb 2016 19:47:17 +0000 (20:47 +0100)]
net: phy: spi_ks8995: add support for MICREL KSZ8795CLX
Add support for MICREL KSZ8795CLX Integrated 5-Port, 10-/100-Managed
Ethernet Switch with Gigabit GMII/RGMII and MII/RMII interfaces.
Signed-off-by: Helmut Buchsbaum <helmut.buchsbaum@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Helmut Buchsbaum [Tue, 9 Feb 2016 19:47:16 +0000 (20:47 +0100)]
net: phy: spi_ks8995: generalize creation of SPI commands
Prepare creating SPI reads and writes for other switch families.
The KS8995 family uses the straight forward
<8bit CMD><8bit ADDR>
sequence.
To be able to support KSZ8795 family, which uses
<3bit CMD><12bit ADDR><1 bit TR>
make the SPI command creation chip variant dependent.
Signed-off-by: Helmut Buchsbaum <helmut.buchsbaum@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Helmut Buchsbaum [Tue, 9 Feb 2016 19:47:15 +0000 (20:47 +0100)]
net: phy: spi_ks8995: add support for resetting switch using GPIO
When using device tree it is no more possible to reset the PHY at board
level. Furthermore, doing in the driver allows to power down the switch
when it is not used any more.
The patch introduces a new optional property "reset-gpios" denoting an
appropriate GPIO handle, e.g.:
reset-gpios = <&gpio0 46 GPIO_ACTIVE_LOW>
Signed-off-by: Helmut Buchsbaum <helmut.buchsbaum@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Helmut Buchsbaum [Tue, 9 Feb 2016 19:47:14 +0000 (20:47 +0100)]
net: phy: spi_ks8995: verify chip and determine revision
Since the chip variant is now determined by spi_device_id, verify
family and chip id and determine the revision id.
Signed-off-by: Helmut Buchsbaum <helmut.buchsbaum@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Helmut Buchsbaum [Tue, 9 Feb 2016 19:47:13 +0000 (20:47 +0100)]
net: phy: spi_ks8995: introduce spi_device_id table
Refactor to use spi_device_id table to facilitate easy
extendability.
Signed-off-by: Helmut Buchsbaum <helmut.buchsbaum@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 16:30:32 +0000 (11:30 -0500)]
Merge branch 'thunderx-irq-hints'
Sunil Goutham says:
====================
net: thunderx: Setting IRQ affinity hints and other optimizations
This patch series contains changes
- To add support for virtual function's irq affinity hint
- Replace napi_schedule() with napi_schedule_irqoff()
- Reduce page allocation overhead by allocating pages
of higher order when pagesize is 4KB.
- Add couple of stats which helps in debugging
- Some miscellaneous changes to BGX driver.
Changes from v1:
- As suggested changed MAC address invalid log message
to dev_err() instead of dev_warn().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Thu, 11 Feb 2016 16:20:26 +0000 (21:50 +0530)]
net: thunderx: Alloc higher order pages when pagesize is small
Allocate higher order pages when pagesize is small, this will
reduce number of calls to page allocator and wastage of memory.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Richter [Thu, 11 Feb 2016 16:20:25 +0000 (21:50 +0530)]
net: thunderx: bgx: Add log message when setting mac address
Signed-off-by: Robert Richter <rrichter@cavium.com>
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Daney [Thu, 11 Feb 2016 16:20:24 +0000 (21:50 +0530)]
net: thunderx: bgx: Use standard firmware node infrastructure.
In the case of OF device tree, the firmware information is attached to
the BGX device structure in the standard manner, so use the firmware
iterators and accessors where possible.
Signed-off-by: David Daney <david.daney@cavium.com>
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Thu, 11 Feb 2016 16:20:23 +0000 (21:50 +0530)]
net: thunderx: Assign affinity hints to vf's interrupts
This affinity hint can be used by user space irqbalance tool to set
preferred CPU mask for irqs registered by this VF. Irqbalance needs
to be in 'exact' mode to set irq affinity same as indicated by
affinity hint.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Thu, 11 Feb 2016 16:20:22 +0000 (21:50 +0530)]
net: thunderx: Use napi_schedule_irqoff()
napi_schedule is being called from hard irq context, hence
switch to napi_schedule_irqoff which avoids unneeded call
to local_irq_save and local_irq_restore.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thanneeru Srinivasulu [Thu, 11 Feb 2016 16:20:21 +0000 (21:50 +0530)]
net, thunderx: Add TX timeout and RX buffer alloc failure stats.
When system is low on atomic memory, too many error messages are logged.
Since this is not a total failure but a simple switch to non-atomic allocation
better to have a stat.
Also add a stat for reset, kicked due to transmit watchdog timeout.
Signed-off-by: Thanneeru Srinivasulu <tsrinivasulu@caviumnetworks.com>
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 14:59:28 +0000 (09:59 -0500)]
Merge branch 'igmp-ns'
Nikolay Borisov says:
====================
Make igmp sysctl knobs namespace aware
This series continue making more of the net related sysctls
namespace aware. The first 2 and last patches are straight
forward and convert sysctls which weren't defined to be
namespace aware. The only thing in them is that each removes
a define which is used in only one place (to initialise
the respective sysctl) so I don't think this is a huge loss.
The third patch however, converts igmp_llm_reports which was
already defined in the ipv4_net_table but wasn't using any of
the net namespace infrastructure.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Borisov [Mon, 8 Feb 2016 21:29:24 +0000 (23:29 +0200)]
igmp: Namespacify igmp_qrv sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Borisov [Mon, 8 Feb 2016 22:13:50 +0000 (00:13 +0200)]
igmp: Namespaceify igmp_llm_reports sysctl knob
This was initially introduced in
df2cf4a78e488d26 ("IGMP: Inhibit
reports for local multicast groups") by defining the sysctl in the
ipv4_net_table array, however it was never implemented to be
namespace aware. Fix this by changing the code accordingly.
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Borisov [Mon, 8 Feb 2016 21:29:22 +0000 (23:29 +0200)]
igmp: Namespaceify igmp_max_msf sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Borisov [Mon, 8 Feb 2016 21:29:21 +0000 (23:29 +0200)]
igmp: Namespaceify igmp_max_memberships sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhang Shengju [Tue, 9 Feb 2016 10:37:46 +0000 (10:37 +0000)]
bonding: use return instead of goto
Replace 'goto' with 'return' to remove unnecessary check at label:
err_undo_flags.
The reason is that 'err_undo_flags' do two things for the first slave device:
1.revert bond mac address if it is set by the slave device.
2.revert bond device type if it's not ARPHRD_ETHER.
It's not necessary for the following three places, they changed neither bond
mac address nor type. It's straightforward to return directly.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergio Prado [Tue, 9 Feb 2016 14:07:16 +0000 (12:07 -0200)]
net: macb: add wake-on-lan support via magic packet
Tested on Acqua A5 SoM (http://www.acmesystems.it/acqua).
Signed-off-by: Sergio Prado <sergio.prado@e-labworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amitoj Kaur Chawla [Wed, 10 Feb 2016 04:38:54 +0000 (10:08 +0530)]
net: hamradio: baycom_ser_fdx: Replace timeval with timespec64
32 bit systems using 'struct timeval' will break in the year 2038, so
we replace the code appropriately. However, this driver is not broken
in 2038 since we are only using microseconds portion of the time.
This patch replaces 'struct timeval' with 'struct timespec64'. We only
need to find elapsed microseconds rather than absolute time, so it's
better to use monotonic time, so using ktime_get_ts64() makes the code
more efficient and more robust against concurrent settimeofday()
calls.
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Thomas Sailer <t.sailer@alumni.ethz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tycho Andersen [Fri, 5 Feb 2016 16:20:52 +0000 (09:20 -0700)]
openvswitch: allow management from inside user namespaces
Operations with the GENL_ADMIN_PERM flag fail permissions checks because
this flag means we call netlink_capable, which uses the init user ns.
Instead, let's introduce a new flag, GENL_UNS_ADMIN_PERM for operations
which should be allowed inside a user namespace.
The motivation for this is to be able to run openvswitch in unprivileged
containers. I've tested this and it seems to work, but I really have no
idea about the security consequences of this patch, so thoughts would be
much appreciated.
v2: use the GENL_UNS_ADMIN_PERM flag instead of a check in each function
v3: use separate ifs for UNS_ADMIN_PERM and ADMIN_PERM, instead of one
massive one
Reported-by: James Page <james.page@canonical.com>
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
CC: Eric Biederman <ebiederm@xmission.com>
CC: Pravin Shelar <pshelar@ovn.org>
CC: Justin Pettit <jpettit@nicira.com>
CC: "David S. Miller" <davem@davemloft.net>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael S. Tsirkin [Sun, 7 Feb 2016 21:27:55 +0000 (23:27 +0200)]
ethtool: future-proof interface for speed extensions
Many virtual and not quite virtual devices allow any speed to be set
through ethtool. In particular, this applies to the virtio-net devices.
Document this fact to make sure people don't assume the enum lists all
possible values. Reserve values greater than INT_MAX for future
extension and to avoid conflict with SPEED_UNKNOWN.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Wed, 10 Feb 2016 06:11:27 +0000 (22:11 -0800)]
vrf: duplicate include of rtnetlink.h
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Wed, 10 Feb 2016 06:07:29 +0000 (22:07 -0800)]
vxlan: udp_tunnel duplicate include net/udp_tunnel.h
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Wed, 10 Feb 2016 06:04:47 +0000 (22:04 -0800)]
rds: duplicate include net/tcp.h
Duplicate include detected.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amitoj Kaur Chawla [Sun, 7 Feb 2016 05:26:25 +0000 (10:56 +0530)]
bonding: Return correct error code
The return value of kzalloc on failure of allocation of memory should
be -ENOMEM and not -1.
Found using Coccinelle. A simplified version of the semantic patch
used is:
//<smpl>
@@
expression *e;
@@
e = kzalloc(...);
if (e == NULL) {
...
return
- -1
+ -ENOMEM
;
}
//</smpl>
The single call site only checks that the return value is not 0,
hence no change is required at the call site.
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 13:55:42 +0000 (08:55 -0500)]
Merge branch 'gso-checksums'
Alexander Duyck says:
====================
Add GSO support for outer checksum w/ inner checksum offloads
This patch series updates the existing segmentation offload code for
tunnels to make better use of existing and updated GSO checksum
computation. This is done primarily through two mechanisms. First we
maintain a separate checksum in the GSO context block of the sk_buff. This
allows us to maintain two checksum values, one offloaded with values stored
in csum_start and csum_offset, and one computed and tracked in
SKB_GSO_CB(skb)->csum. By maintaining these two values we are able to take
advantage of the same sort of math used in local checksum offload so that
we can provide both inner and outer checksums with minimal overhead.
Below is the performance for a netperf session between an ixgbe PF and VF
on the same host but in different namespaces. As can be seen a significant
gain in performance can be had from allowing the use of Tx checksum offload
on the inner headers while performing a software offload on the outer
header computation:
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
Before:
87380 16384 16384 10.00 12844.38 9.30 -1.00 0.712 -1.00
After:
87380 16384 16384 10.00 13216.63 6.78 -1.00 0.504 -1.000
Changes from v1:
* Dropped use of CHECKSUM_UNNECESSARY for remote checksum offload
* Left encap_hdr_csum as it will likely be needed in future for SCTP GSO
* Broke the changes out over many more patches
* Updated GRE segmentation to more closely match UDP tunnel segmentation
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:28:26 +0000 (15:28 -0800)]
net: Allow tunnels to use inner checksum offloads with outer checksums needed
This patch enables us to use inner checksum offloads if provided by
hardware with outer checksums computed by software.
It basically reduces encap_hdr_csum to an advisory flag for now, but based
on the fact that SCTP may be getting segmentation support before long I
thought we may want to keep it as it is possible we may need to support
CRC32c and 1's compliment checksum in the same packet at some point in the
future.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:28:20 +0000 (15:28 -0800)]
udp: Use uh->len instead of skb->len to compute checksum in segmentation
The segmentation code was having to do a bunch of work to pull the
skb->len and strip the udp header offset before the value could be used to
adjust the checksum. Instead of doing all this work we can just use the
value that goes into uh->len since that is the correct value with the
correct byte order that we need anyway. By using this value we can save
ourselves a bunch of pain as there is no need to do multiple byte swaps.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:28:14 +0000 (15:28 -0800)]
udp: Clean up the use of flags in UDP segmentation offload
This patch goes though and cleans up the logic related to several of the
control flags used in UDP segmentation. Specifically the use of dont_encap
isn't really needed as we can just check the skb for CHECKSUM_PARTIAL and
if it isn't set then we don't need to update the internal headers. As such
we can just drop that value.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:28:08 +0000 (15:28 -0800)]
gre: Use inner_proto to obtain inner header protocol
Instead of parsing headers to determine the inner protocol we can just pull
the value from inner_proto.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:28:01 +0000 (15:28 -0800)]
gre: Use GSO flags to determine csum need instead of GRE flags
This patch updates the gre checksum path to follow something much closer to
the UDP checksum path. By doing this we can avoid needing to do as much
header inspection and can just make use of the fields we were already
reading in the sk_buff structure.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:27:55 +0000 (15:27 -0800)]
net: Move skb_has_shared_frag check out of GRE code and into segmentation
The call skb_has_shared_frag is used in the GRE path and skb_checksum_help
to verify that no frags can be modified by an external entity. This check
really doesn't belong in the GRE path but in the skb_segment function
itself. This way any protocol that might be segmented will be performing
this check before attempting to offload a checksum to software.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:27:49 +0000 (15:27 -0800)]
net: Store checksum result for offloaded GSO checksums
This patch makes it so that we can offload the checksums for a packet up
to a certain point and then begin computing the checksums via software.
Setting this up is fairly straight forward as all we need to do is reset
the values stored in csum and csum_start for the GSO context block.
One complication for this is remote checksum offload. In order to allow
the inner checksums to be offloaded while computing the outer checksum
manually we needed to have some way of indicating that the offload wasn't
real. In order to do that I replaced CHECKSUM_PARTIAL with
CHECKSUM_UNNECESSARY in the case of us computing checksums for the outer
header while skipping computing checksums for the inner headers. We clean
up the ip_summed flag and set it to either CHECKSUM_PARTIAL or
CHECKSUM_NONE once we hand the packet off to the next lower level.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:27:43 +0000 (15:27 -0800)]
net: Update remote checksum segmentation to support use of GSO checksum
This patch addresses two main issues.
First in the case of remote checksum offload we were avoiding dealing with
scatter-gather issues. As a result it would be possible to assemble a
series of frames that used frags instead of being linearized as they should
have if remote checksum offload was enabled.
Second I have updated the code so that we now let GSO take care of doing
the checksum on the data itself and drop the special case that was added
for remote checksum offload.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:27:37 +0000 (15:27 -0800)]
net: Move GSO csum into SKB_GSO_CB
This patch moves the checksum maintained by GSO out of skb->csum and into
the GSO context block in order to allow for us to work on outer checksums
while maintaining the inner checksum offsets in the case of the inner
checksum being offloaded, while the outer checksums will be computed.
While updating the code I also did a minor cleanu-up on gso_make_checksum.
The change is mostly to make it so that we store the values and compute the
checksum instead of computing the checksum and then storing the values we
needed to update.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:27:31 +0000 (15:27 -0800)]
net: Drop unecessary enc_features variable from tunnel segmentation functions
The enc_features variable isn't necessary since features isn't used
anywhere after we create enc_features so instead just use a destructive AND
on features itself and save ourselves the variable declaration.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sixiao@microsoft.com [Thu, 4 Feb 2016 23:49:34 +0000 (15:49 -0800)]
hv_netvsc: cleanup netdev feature flags for netvsc
1. Adding NETIF_F_TSO6 feature flag;
2. Adding NETIF_F_HW_CSUM. NETIF_F_IPV6_CSUM and NETIF_F_IP_CSUM are
being deprecated;
3. Cleanup the coding style of flag assignment by using macro.
Signed-off-by: Simon Xiao <sixiao@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 12:16:29 +0000 (07:16 -0500)]
Merge branch 'ethtool-nfc-ipv6'
Edward Cree says:
====================
IPv6 NFC
This series adds support for steering IPv6 flows using the ethtool NFC
interface, and implements it for sfc devices.
Tested using an in-development patch to the ethtool utility.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Fri, 5 Feb 2016 11:16:50 +0000 (11:16 +0000)]
sfc: implement IPv6 NFC (and IPV4_USER_FLOW)
Signed-off-by: Edward Cree <ecree@solarflare.com>
Reviewed-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Fri, 5 Feb 2016 11:16:21 +0000 (11:16 +0000)]
ethtool: add IPv6 to the NFC API
Signed-off-by: Edward Cree <ecree@solarflare.com>
Reviewed-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 12:13:29 +0000 (07:13 -0500)]
Merge branch 'cxgb4-tos'
Hariprasad Shenai says:
====================
Add TOS support and some cleanup
This series adds TOS support for iWARP and also does some cleanup to make
code more readable. Patch series is created against infiniband tree and
includes patches on iw_cxgb4 and cxgb4 driver.
We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Fri, 5 Feb 2016 06:13:30 +0000 (11:43 +0530)]
cxgb4/iw_cxgb4: TOS support
This series provides support for iWARP applications to specify a TOS
value and have that map to a VLAN Priority for iw_cxgb4 iWARP connections.
In iw_cxgb4, when allocating an L2T entry, pass the skb_priority based
on the tos value in the cm_id. Also pass the correct tos value during
connection setup so the passive side gets the client's desired tos.
When sending the FLOWC work request to FW, if the egress device is
in a vlan, then use the vlan priority bits as the scheduling class.
This allows associating RDMA connections with scheduling classes to
provide traffic shaping per flow.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Fri, 5 Feb 2016 06:13:29 +0000 (11:43 +0530)]
iw_cxgb4: remove false error log entry
Don't log errors if a listening endpoint is going away when procesing a
PASS_ACCEPT_REQ message. This can happen. Change the error printk to
a PDBG() debug log entry
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Fri, 5 Feb 2016 06:13:28 +0000 (11:43 +0530)]
iw_cxgb4: make queue allocation code more readable
Rename local mm* variables to more meaningful names
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 11:15:00 +0000 (06:15 -0500)]
Merge branch 'fec-next'
Troy Kisky says:
====================
net: fec: cleanup/fixes
V2 is a rebase on top of johannes endian-safe patch and
is only the 1st eight patches.
The testing for this series was done on a nitrogen6x.
The base commit was
commit
b45efa30a626e915192a6c548cd8642379cd47cc
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Testing showed no change in performance.
Testing used imx_v6_v7_defconfig + CONFIG_MICREL_PHY.
The processor was running at 996Mhz.
The following commands were used to get the transfer rates.
On an x86 ubunto system,
iperf -s -i.5 -u
On a nitrogen6x board, running via SD Card.
I first stopped some background processes
stop cron
stop upstart-file-bridge
stop upstart-socket-bridge
stop upstart-udev-bridge
stop rsyslog
stop dbus
killall dhclient
echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
taskset 0x2 iperf -c 192.168.0.201 -u -t60 -b500M -r
There is a branch available on github with this series, and the rest of
my fec patches, for those who would like to test it.
https://github.com:boundarydevices/linux-imx6.git branch net-next_master
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Troy Kisky [Fri, 5 Feb 2016 21:52:50 +0000 (14:52 -0700)]
net: fec: improve error handling
Unmap initial buffer on error.
Don't free skb until it has been unmapped.
Move cbd_bufaddr assignment closer to the mapping function.
Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Troy Kisky [Fri, 5 Feb 2016 21:52:49 +0000 (14:52 -0700)]
net: fec: don't transfer ownership until descriptor write is complete
If you don't own it, you shouldn't write to it.
Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Troy Kisky [Fri, 5 Feb 2016 21:52:48 +0000 (14:52 -0700)]
net: fec: don't disable FEC_ENET_TS_TIMER interrupt
Only the interrupt routine processes this condition.
Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Troy Kisky [Fri, 5 Feb 2016 21:52:47 +0000 (14:52 -0700)]
net: fec: add variable reg_desc_active to speed things up
There is no need for complex macros every time we need to activate
a queue. Also, no need to call skb_get_queue_mapping when we already
know which queue it is using.
Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Troy Kisky [Fri, 5 Feb 2016 21:52:46 +0000 (14:52 -0700)]
net: fec: add struct bufdesc_prop
This reduces code and gains speed.
Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Troy Kisky [Fri, 5 Feb 2016 21:52:45 +0000 (14:52 -0700)]
net: fec: fix fec_enet_get_free_txdesc_num
When first initialized, cur_tx points to the 1st
entry in the queue, and dirty_tx points to the last.
At this point, fec_enet_get_free_txdesc_num will
return tx_ring_size -2. If tx_ring_size -2 entries
are now queued, then fec_enet_get_free_txdesc_num
should return 0, but it returns tx_ring_size instead.
Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Troy Kisky [Fri, 5 Feb 2016 21:52:44 +0000 (14:52 -0700)]
net: fec: fix rx error counts
On an overrun, the other flags are not
valid, so don't check them.
Also, don't pass bad frames up the stack.
Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Troy Kisky [Fri, 5 Feb 2016 21:52:43 +0000 (14:52 -0700)]
net: fec: stop the "rcv is not +last, " error messages
Setting the FTRL register will stop the fec from
trying to use multiple receive buffers.
Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Thu, 4 Feb 2016 16:42:28 +0000 (17:42 +0100)]
bonding: 3ad: allow to set ad_actor settings while the bond is up
No need to require the bond down while changing these settings, the change
will be reflected immediately and the 3ad mode will sort itself out.
For faster convergence set port->ntt to true in order to generate new
LACPDUs immediately.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Thu, 4 Feb 2016 12:31:20 +0000 (13:31 +0100)]
ipv6: add option to drop unsolicited neighbor advertisements
In certain 802.11 wireless deployments, there will be NA proxies
that use knowledge of the network to correctly answer requests.
To prevent unsolicitd advertisements on the shared medium from
being a problem, on such deployments wireless needs to drop them.
Enable this by providing an option called "drop_unsolicited_na".
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Thu, 4 Feb 2016 12:31:19 +0000 (13:31 +0100)]
ipv6: add option to drop unicast encapsulated in L2 multicast
In order to solve a problem with 802.11, the so-called hole-196 attack,
add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if
enabled, causes the stack to drop IPv6 unicast packets encapsulated in
link-layer multi- or broadcast frames. Such frames can (as an attack)
be created by any member of the same wireless network and transmitted
as valid encrypted frames since the symmetric key for broadcast frames
is shared between all stations.
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Thu, 4 Feb 2016 12:31:18 +0000 (13:31 +0100)]
ipv4: add option to drop gratuitous ARP packets
In certain 802.11 wireless deployments, there will be ARP proxies
that use knowledge of the network to correctly answer requests.
To prevent gratuitous ARP frames on the shared medium from being
a problem, on such deployments wireless needs to drop them.
Enable this by providing an option called "drop_gratuitous_arp".
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Thu, 4 Feb 2016 12:31:17 +0000 (13:31 +0100)]
ipv4: add option to drop unicast encapsulated in L2 multicast
In order to solve a problem with 802.11, the so-called hole-196 attack,
add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if
enabled, causes the stack to drop IPv4 unicast packets encapsulated in
link-layer multi- or broadcast frames. Such frames can (as an attack)
be created by any member of the same wireless network and transmitted
as valid encrypted frames since the symmetric key for broadcast frames
is shared between all stations.
Additionally, enabling this option provides compliance with a SHOULD
clause of RFC 1122.
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Siva Reddy Kallam [Thu, 4 Feb 2016 09:50:47 +0000 (15:20 +0530)]
MAINTAINERS: Update tg3 maintainer
Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Acked-by: Prashant Sreedharan <prashant@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Tang [Thu, 4 Feb 2016 09:36:23 +0000 (17:36 +0800)]
bpf_dbg: do not initialise statics to 0
This patch fixes the checkpatch.pl error to bpf_dbg.c:
ERROR: do not initialise statics to 0
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 2 Feb 2016 16:17:07 +0000 (08:17 -0800)]
net: Add support for filtering link dump by master device and kind
Add support for filtering link dumps by master device and kind, similar
to the filtering implemented for neighbor dumps.
Each net_device that exists adds between 1196 bytes (eth) and 1556 bytes
(bridge) to the link dump. As the number of interfaces increases so does
the amount of data pushed to user space for a link list. If the user
only wants to see a list of specific devices (e.g., interfaces enslaved
to a specific bridge or a list of VRFs) most of that data is thrown away.
Passing the filters to the kernel to have only relevant data returned
makes the dump more efficient.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 08:54:23 +0000 (03:54 -0500)]
Merge branch 'tcp-fast-so_reuseport'
Craig Gallek says:
====================
Faster SO_REUSEPORT for TCP
This patch series complements an earlier series (
6a5ef90c58da)
which added faster SO_REUSEPORT lookup for UDP sockets by
extending the feature to TCP sockets. It uses the same
array-based data structure which allows for socket selection
after finding the first listening socket that matches an incoming
packet. Prior to this feature, every socket in the reuseport
group needed to be found and examined before a selection could be
made.
With this series the SO_ATTACH_REUSEPORT_CBPF and
SO_ATTACH_REUSEPORT_EBPF socket options now work for TCP sockets
as well. The test at the end of the series includes an example of
how to use these options to select a reuseport socket based on the
cpu core id handling the incoming packet.
There are several refactoring patches that precede the feature
implementation. Only the last two patches in this series
should result in any behavioral changes.
v4
- Fix build issue when compiling IPv6 as a module. This required
moving the ipv6_rcv_saddr_equal into an object that is included as a
built-in object. I included this change in the second patch which
adds inet6_hash since that is where ipv6_rcv_saddr_equal will
later be called from non-module code.
v3:
- Another warning in the first patch caught by a build bot. Return 0 in
the no-op UDP hash function.
v2:
- In the first patched I missed a couple of hash functions that should now be
returning int instead of void. I missed these the first time through as it
only generated a warning and not an error :\
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Wed, 10 Feb 2016 16:50:41 +0000 (11:50 -0500)]
soreuseport: BPF selection functional test for TCP
Unfortunately the existing test relied on packet payload in order to
map incoming packets to sockets. In order to get this to work with TCP,
TCP_FASTOPEN needed to be used.
Since the fast open path is slightly different than the standard TCP path,
I created a second test which sends to reuseport group members based
on receiving cpu core id. This will probably serve as a better
real-world example use as well.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Wed, 10 Feb 2016 16:50:40 +0000 (11:50 -0500)]
soreuseport: fast reuseport TCP socket selection
This change extends the fast SO_REUSEPORT socket lookup implemented
for UDP to TCP. Listener sockets with SO_REUSEPORT and the same
receive address are additionally added to an array for faster
random access. This means that only a single socket from the group
must be found in the listener list before any socket in the group can
be used to receive a packet. Previously, every socket in the group
needed to be considered before handing off the incoming packet.
This feature also exposes the ability to use a BPF program when
selecting a socket from a reuseport group.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Wed, 10 Feb 2016 16:50:39 +0000 (11:50 -0500)]
soreuseport: Prep for fast reuseport TCP socket selection
Both of the lines in this patch probably should have been included
in the initial implementation of this code for generic socket
support, but weren't technically necessary since only UDP sockets
were supported.
First, the sk_reuseport_cb points to a structure which assumes
each socket in the group has this pointer assigned at the same
time it's added to the array in the structure. The sk_clone_lock
function breaks this assumption. Since a child socket shouldn't
implicitly be in a reuseport group, the simple fix is to clear
the field in the clone.
Second, the SO_ATTACH_REUSEPORT_xBPF socket options require that
SO_REUSEPORT also be set first. For UDP sockets, this is easily
enforced at bind-time since that process both puts the socket in
the appropriate receive hlist and updates the reuseport structures.
Since these operations can happen at two different times for TCP
sockets (bind and listen) it must be explicitly checked to enforce
the use of SO_REUSEPORT with SO_ATTACH_REUSEPORT_xBPF in the
setsockopt call.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Wed, 10 Feb 2016 16:50:38 +0000 (11:50 -0500)]
inet: refactor inet[6]_lookup functions to take skb
This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
groups. Doing so with a BPF filter will require access to the
skb in question. This change plumbs the skb (and offset to payload
data) through the call stack to the listening socket lookup
implementations where it will be used in a following patch.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Wed, 10 Feb 2016 16:50:37 +0000 (11:50 -0500)]
tcp: __tcp_hdrlen() helper
tcp_hdrlen is wasteful if you already have a pointer to struct tcphdr.
This splits the size calculation into a helper function that can be
used if a struct tcphdr is already available.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Wed, 10 Feb 2016 16:50:36 +0000 (11:50 -0500)]
inet: create IPv6-equivalent inet_hash function
In order to support fast lookups for TCP sockets with SO_REUSEPORT,
the function that adds sockets to the listening hash set needs
to be able to check receive address equality. Since this equality
check is different for IPv4 and IPv6, we will need two different
socket hashing functions.
This patch adds inet6_hash identical to the existing inet_hash function
and updates the appropriate references. A following patch will
differentiate the two by passing different comparison functions to
__inet_hash.
Additionally, in order to use the IPv6 address equality function from
inet6_hashtables (which is compiled as a built-in object when IPv6 is
enabled) it also needs to be in a built-in object file as well. This
moves ipv6_rcv_saddr_equal into inet_hashtables to accomplish this.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Wed, 10 Feb 2016 16:50:35 +0000 (11:50 -0500)]
sock: struct proto hash function may error
In order to support fast reuseport lookups in TCP, the hash function
defined in struct proto must be capable of returning an error code.
This patch changes the function signature of all related hash functions
to return an integer and handles or propagates this return value at
all call sites.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 08:49:55 +0000 (03:49 -0500)]
Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
Here you have a batch of patches by Sven Eckelmann that
drops our private reference counting implementation and
substitutes it with the kref objects/functions.
Then you have a patch, by Simon Wunderlich, that
makes the broadcast protection window code more generic so
that it can be re-used in the future by other components
with different requirements.
Lastly, Sven is also introducing two lockdep asserts in
functions operating on our TVLV container list, to make
sure that the proper lock is always acquired by the users.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 08:47:05 +0000 (03:47 -0500)]
Merge branch 'be2net-next'
Ajit Khaparde says:
====================
be2net Patch series
Please consider applying these two patches to net-next
Patch-1: Request RSS capability of Rx interface depending on number of
Rx rings
Patch-2: Interpret and log new data that's added to the port
misconfigure async event
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ajit Khaparde [Wed, 10 Feb 2016 17:15:54 +0000 (22:45 +0530)]
be2net: Interpret and log new data that's added to the port misconfigure async event
>From FW version 11.0. onwards, the PORT_MISCONFIG event generated by the FW
will carry more information about the event in the "data_word1"
and "data_word2" fields. This patch adds support in the driver to parse the
new information and log it accordingly. This patch also changes some of the
messages that are being logged currently.
Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com>
Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ajit Khaparde [Wed, 10 Feb 2016 17:15:53 +0000 (22:45 +0530)]
be2net: Request RSS capability of Rx interface depending on number of Rx rings
Currently we request RSS capability even if a single Rx ring is created.
As a result in few cases we unnecessarily consume an RSS capable interface
which is a limited resource in the chip.
This patch enables RSS on an interface only if more than one Rx ring
is created.
Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sven Eckelmann [Sat, 16 Jan 2016 09:29:57 +0000 (10:29 +0100)]
batman-adv: Convert batadv_tt_common_entry to kref
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Sven Eckelmann [Sat, 16 Jan 2016 09:29:56 +0000 (10:29 +0100)]
batman-adv: Convert batadv_orig_node to kref
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Sven Eckelmann [Sat, 16 Jan 2016 09:29:55 +0000 (10:29 +0100)]
batman-adv: Convert batadv_orig_node_vlan to kref
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Sven Eckelmann [Sat, 16 Jan 2016 09:29:54 +0000 (10:29 +0100)]
batman-adv: Convert batadv_hard_iface to kref
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Sven Eckelmann [Sat, 16 Jan 2016 09:29:53 +0000 (10:29 +0100)]
batman-adv: Convert batadv_neigh_node to kref
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Sven Eckelmann [Sat, 16 Jan 2016 09:29:52 +0000 (10:29 +0100)]
batman-adv: Convert batadv_orig_ifinfo to kref
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Sven Eckelmann [Sat, 16 Jan 2016 09:29:51 +0000 (10:29 +0100)]
batman-adv: Convert batadv_neigh_ifinfo to kref
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Sven Eckelmann [Sat, 16 Jan 2016 09:29:50 +0000 (10:29 +0100)]
batman-adv: Convert batadv_tt_orig_list_entry to kref
batman-adv uses a self-written reference implementation which is just based
on atomic_t. This is less obvious when reading the code than kref and
therefore increases the change that the reference counting will be missed.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>