Florian Fainelli [Wed, 18 Dec 2013 05:38:05 +0000 (21:38 -0800)]
net: phy: cicada: fix checkpath errors
checkpath spotted a few stylistic errors fix them.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 18 Dec 2013 05:36:51 +0000 (00:36 -0500)]
Merge branch 'vlan_tpid'
Atzm Watanabe says:
====================
packet: deliver VLAN TPID to userspace
This patchset enables userspace to get VLAN TPID as well as the VLAN TCI.
After the 802.1AD support, userspace packet receivers (packet dumper,
software switch, and the like) need how to know VLAN TPID in order to
reconstruct original tagged frame.
v4: Simply use sizeof(tp_padding) for zeroing the padding bytes,
commented by David Laight.
Use __u16 for tp_vlan_tpid in tpacket_hdr_variant1,
commented by Daniel Borkmann.
v3: Add a definition which indicates whether tp_vlan_tpid is valid.
Explicitly define pad bytes for tpacket{2,3}_hdr and pick the area
for tp_vlan_tpid from the definition. Commented by David Laight.
v2: Add BUILD_BUG_ON() to make current aligned size of
struct tpacket{2,3}_hdr clear. Commented by Ben Hutchings.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Atzm Watanabe [Tue, 17 Dec 2013 13:53:40 +0000 (22:53 +0900)]
packet: deliver VLAN TPID to userspace
This enables userspace to get VLAN TPID as well as the VLAN TCI.
Signed-off-by: Atzm Watanabe <atzm@stratosphere.co.jp>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Atzm Watanabe [Tue, 17 Dec 2013 13:53:36 +0000 (22:53 +0900)]
packet: fill the gap of TPACKET_ALIGNMENT with zeros
struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT.
Explicitly defining and zeroing the gap of this makes additional changes
easier.
Signed-off-by: Atzm Watanabe <atzm@stratosphere.co.jp>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Atzm Watanabe [Tue, 17 Dec 2013 13:53:32 +0000 (22:53 +0900)]
packet: make aligned size of struct tpacket{2,3}_hdr clear
struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT.
We may add members to them until current aligned size without forcing
userspace to call getsockopt(..., PACKET_HDRLEN, ...).
Signed-off-by: Atzm Watanabe <atzm@stratosphere.co.jp>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 18 Dec 2013 05:31:01 +0000 (00:31 -0500)]
Merge branch 'bna'
Rasesh Mody says:
====================
bna: Update the Driver to v3.2.23.0
This patch set consists of feature additions like s/w timestamping support,
multi-buffer RX, firmware patch simplification, enhancements for RX filters,
RX processing changes, bug fixes and updates the firmware version to v3.2.3.0.
The patch set addressed the review commnets recieved.
This patch set updates the BNA driver to v3.2.23.0 and was tested against
net-next 3.12.0-rc6 kernel.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rasesh Mody [Wed, 18 Dec 2013 01:07:42 +0000 (17:07 -0800)]
bna: Update the Driver Version to 3.2.23.0
Signed-off-by: Rasesh Mody <rmody@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rasesh Mody [Wed, 18 Dec 2013 01:07:41 +0000 (17:07 -0800)]
bna: Firmware Patch Simplification
This patch includes change to enable firmware patch simplication feature.
This feature is targeted to address the requirement to have independent patch
release for firmware. Prior to the 3.2.3.0 firmware, releasing a patch fix for
firmware would require changes to bna driver, to use new firmware images.
However with these changes, if the new firmware is flashed on to the Adapter,
the driver will use the new firmware after checking the patch release byte in
the firmware version.
Update the f/w version to 3.2.3.0
Signed-off-by: Rasesh Mody <rmody@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rasesh Mody [Wed, 18 Dec 2013 01:07:40 +0000 (17:07 -0800)]
bna: Embed SKB Length in TX Vector
- Store the length of the skb buffer mapped along with the handle and use it
while unmapping the buffer.
Signed-off-by: Rasesh Mody <rmody@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rasesh Mody [Wed, 18 Dec 2013 01:07:39 +0000 (17:07 -0800)]
bna: Handle the TX Setup Failures
Change details:
- When bnad_setup_tx() returns NULL, the error is NOT returned to the caller.
The caller will incorrectly assume success. So Return ENOMEM when bna_tx_create()
fails.
- If bnad_tx_msix_register() fails, call bna_tx_destroy() to free tx & to NULL
the bnad reference to tcb.
Signed-off-by: Rasesh Mody <rmody@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rasesh Mody [Wed, 18 Dec 2013 01:07:38 +0000 (17:07 -0800)]
bna: Add NULL Check Before Dereferencing TCB
Currently we already check to see whether the BNAD_TXQ_TX_STARTED cleared.
But if the tcb structure which contains this flag is also already freed by that
time, we would dereference the NULL pointer. This patch is to check tcb for NULL
pointer, before dereferencing it.
Signed-off-by: Rasesh Mody <rmody@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rasesh Mody [Wed, 18 Dec 2013 01:07:37 +0000 (17:07 -0800)]
bna: CQ Read Fix
Valid bit check for completion needs read fence, so that it does not get
reordered with other loads.
Signed-off-by: Rasesh Mody <rmody@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rasesh Mody [Wed, 18 Dec 2013 01:07:36 +0000 (17:07 -0800)]
bna: RX Processing and Config Changes
Change Details:
- Prefetch header in GRO path. This reduces napi_frags_skb time from 9% to 5%.
- Changed the configurable limit of RxQ depth to 16384 (was 2048).
- bnad_rx_unmap_q elements are cachealigned.
Signed-off-by: Rasesh Mody <rmody@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rasesh Mody [Wed, 18 Dec 2013 01:07:35 +0000 (17:07 -0800)]
bna: Enable Multi Buffer RX
The CT2 HW supports multi-buffer Rx. This patch provides the necessary changes
for bnad to use multi-buffer Rx feature. For BNAD, multi-buffer Rx is by
default enabled when MTU is > 4096. For >4096 MTU, q0 data/large buffers are of
2048 size. As the resource requirements of multi-buffer Rx are different new Rx
needs to be created to use this feature. ASIC posts multiple completions if
frame exceeds buffer size. The last completion is marked with EOP flag.
- Separate HQ and DQ enums for resource allocations and configurations.
- rx_config and rxq structure changes to pass the correct info from bnad.
- DQ depth need not be same as HQ depth. So CQ depth is adjusted accordingly.
- Rx CFG frame size is taken from configured MTU.
- Rx q0 buffer size is configured from bnad s rx_config when multi-buffer is
enabled.
- Poll for entire frame completion.
- Once EOP completion is received gather the number of vectors used by the
frame to submit it to the stack.
- Changed MTU to frame size wherever necessary.
Signed-off-by: Rasesh Mody <rmody@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rasesh Mody [Wed, 18 Dec 2013 01:07:34 +0000 (17:07 -0800)]
bna: RX Filter Enhancements
Change Details:
- Added bna_rx_ucast_listset() for synchronous ucast listadd operation.
- Clear mac->handle before adding it to free_q.
- bnad_set_rx_mode() rewritten. bnad_set_rx_mode() adds the MACs in uc_list
to UCAM. If it exceeds the max supported, DEFAULT mode is turned on. If
MCAM limit is exceeded, ALLMULTI mode is turned on.
- Clear CF flags, check for the new mode and reprogram the Rx approach.
- Added bnad_set_rx_ucast_fltr() and bnad_set_rx_mcast_fltr().
- Check for IFF_PROMISC to set the correct mode.
Signed-off-by: Rasesh Mody <rmody@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rasesh Mody [Wed, 18 Dec 2013 01:07:33 +0000 (17:07 -0800)]
bna: Fix Filter Add Del
Change Details:
- bna_rx_mcast_listset() API first looks at free_q only and not at other
pending Qs rendering it non-deterministic of giving an upper limit.
Modify bna_rx_mcast_listset() implementation to not use only half of the
limit.
- Allocate and initialize queue for deleting
- Segregate the adding and deleting process by using separate queues.
- The filter framework in bna does not let adding addresses to its max capacity
due to asynchronous operations involved.
Provide a synchronous option to set a given list.
Signed-off-by: Rasesh Mody <rmody@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rasesh Mody [Wed, 18 Dec 2013 01:07:32 +0000 (17:07 -0800)]
bna: Set Get IOC fw State
Add APIs to set and get IOC currnet fw state and alt IOC fw state
- bfa_ioc_ct_set_cur_ioc_fwstate()
- bfa_ioc_ct_get_cur_ioc_fwstate()
- bfa_ioc_ct_set_alt_ioc_fwstate()
- bfa_ioc_ct_get_alt_ioc_fwstate()
Signed-off-by: Rasesh Mody <rmody@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rasesh Mody [Wed, 18 Dec 2013 01:07:31 +0000 (17:07 -0800)]
bna: Add software timestamping support
- Invoke skb_tx_timestamp() API just before invoking txq_doorbell()
- Add ethtool (-T) support
Signed-off-by: Rasesh Mody <rmody@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 18 Dec 2013 02:26:19 +0000 (21:26 -0500)]
lib: Add missing arch generic-y entries for asm-generic/hash.h
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eugene Crosser [Mon, 16 Dec 2013 08:44:52 +0000 (09:44 +0100)]
qeth: Accurate ethtool output
For OSA devices that support the QUERY_CARD_INFO command, supply
accurate data based on the card type, port mode and link speed
via the 'ethtool' interface.
Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 16 Dec 2013 08:44:51 +0000 (09:44 +0100)]
netiucv: improve state checking in conn_action_txdone
state checking in conn_action_txdone() is inconsistent.
This patch makes it consistent and issues a trace message
if an unexpected state is detected for the netiucv device.
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Mon, 16 Dec 2013 10:45:01 +0000 (11:45 +0100)]
bpf_exp: free duplicated labels at exit time
Valgrind found that extracted labels that are passed from the lexer
weren't freed upon exit. Therefore, add a small helper function that
walks label tables and frees them. Since also NULL can be passed to
free(3), we do not need to take care of that here. While at it, fix
up a spacing error in bpf_set_curr_label().
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Mon, 16 Dec 2013 10:45:00 +0000 (11:45 +0100)]
bpf_dbg: always close socket in bpf_runnable
We must not leave the socket intact in bpf_runnable(). The socket
is used to test if the filter code is being accepted by the kernel
or not. So right after we do the setsockopt(2), we need to close
it again.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 17 Dec 2013 22:09:49 +0000 (17:09 -0500)]
Merge branch 'qlcnic'
Manish Chopra says:
====================
qlcnic: Refactoring and enhancements
This patch series includes follwing changes
* Refactor DCBX code. Do not allow DCBX operations for VFs
* Issue INIT_NIC mailbox command only once
* Refactor initialize nic code path
* Allow configuration for single TX/RX queue
* VLAN enhancement for 84xx adapters
* Support for 16 virtual NIC functions for 84XX series adapters
Please apply to net-next
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Tue, 17 Dec 2013 14:01:55 +0000 (09:01 -0500)]
qlcnic: update version to 5.3.53
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jitendra Kalsaria [Tue, 17 Dec 2013 14:01:54 +0000 (09:01 -0500)]
qlcnic: Support for 16 virtual NIC functions.
Extend virtual NIC functions from 8 to 16 for 84xx adapter.
Signed-off-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Tue, 17 Dec 2013 14:01:53 +0000 (09:01 -0500)]
qlcnic: VLAN enhancement for 84XX adapters
o Support multiple VLANs on 84xx VF devices
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Himanshu Madhani [Tue, 17 Dec 2013 14:01:52 +0000 (09:01 -0500)]
qlcnic: Allow single Tx/Rx queue for all adapters.
o Allow user to set sigle Tx/Rx queue in MSI-x mode,
for ALL supported adapters.
Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com>
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sucheta Chakraborty [Tue, 17 Dec 2013 14:01:51 +0000 (09:01 -0500)]
qlcnic: Refactor initialize nic code path.
o Change function name from qlcnic_83xx_register_nic_idc_func to
qlcnic_83xx_initialize_nic
Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sucheta Chakraborty [Tue, 17 Dec 2013 14:01:50 +0000 (09:01 -0500)]
qlcnic: Issue INIT_NIC command only once.
o DCB AEN registration was reissuing INIT_NIC command. Instead, club
all options of INIT NIC command and issue this command only once.
Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sucheta Chakraborty [Tue, 17 Dec 2013 14:01:49 +0000 (09:01 -0500)]
qlcnic: Disable DCB operations from SR-IOV VFs.
o These operations will be supported only through PFs (SR-IOV and non-SR-IOV).
Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 17 Dec 2013 22:08:21 +0000 (17:08 -0500)]
Merge branch 'for-davem' of git://git./linux/kernel/git/bwh/sfc-next
Ben Hutchings says:
====================
Miscellaneous changes for 3.14:
1. Add more information to some WARN messages.
2. Refactor pushing of RSS configuration, from Andrew Rybchenko.
3. Refactor handling of automatic (device address list) vs manual (RX
NFC) MAC filters.
4. Implement clearing of manual RX filters on EF10 when ntuple offload
is disabled.
5. Remove definitions that are unused since the RX buffer allocation
changes, from Andrew Rybchenko.
6. Improve naming of some statistics, from Shradha Shah.
7. Add statistics for PTP support code.
8. Fix insertion of RX drop filters on EF10.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 17 Dec 2013 21:36:33 +0000 (16:36 -0500)]
Merge branch 'skb_hash'
Tom Herbert says:
====================
net: Add rxhash utility hash functions
v3:
There's really nothing specific about rxhash that constrains
it to be a value just for these receive path. Drop the 'rx'
part in utility functions, including skb_get_rxhash. In subsequent
patches, we can change the rxhash and l4_rxhash names also, as
well as abstracting out the interface to the hash.
Added comments about hash types per feedback.
In this version I'm omitting the changes to drivers to make the
patch set manageable. Will add those changes in followup pathes.
-----
This patch series introduce skb_set_rxhash and skb_clear_rxhash
which are called to set the rxhash (from network drivers) and
to clear the rxhash. This API should be used instead of updating
fields in the skbuff directly.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 16 Dec 2013 06:16:29 +0000 (22:16 -0800)]
net: Add utility function to copy skb hash
Adds skb_copy_hash to copy rxhash and l4_rxhash from one skb to another.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 16 Dec 2013 06:16:19 +0000 (22:16 -0800)]
net: Add function to set the rxhash
The function skb_set_rxash was added for drivers to call to set
the rxhash in an skb. The type of hash is also specified as
a parameter (L2, L3, L4, or unknown type).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 16 Dec 2013 06:12:18 +0000 (22:12 -0800)]
net: Add utility functions to clear rxhash
In several places 'skb->rxhash = 0' is being done to clear the
rxhash value in an skb. This does not clear l4_rxhash which could
still be set so that the rxhash wouldn't be recalculated on subsequent
call to skb_get_rxhash. This patch adds an explict function to clear
all the rxhash related information in the skb properly.
skb_clear_hash_if_not_l4 clears the rxhash only if it is not marked as
l4_rxhash.
Fixed up places where 'skb->rxhash = 0' was being called.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 16 Dec 2013 06:12:06 +0000 (22:12 -0800)]
net: Change skb_get_rxhash to skb_get_hash
Changing name of function as part of making the hash in skbuff to be
generic property, not just for receive path.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Mon, 16 Dec 2013 06:05:50 +0000 (14:05 +0800)]
net/hsr: using kfree_rcu() to simplify the code
The callback function of call_rcu() just calls a kfree(), so we
can use kfree_rcu() instead of call_rcu() + callback function.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Arvid Brodin <arvid.brodin@alten.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 17 Dec 2013 21:09:15 +0000 (16:09 -0500)]
Merge branch 'bonding_netlink'
Scott Feldman says:
====================
bonding: add some more netlink attributes
The following series implements five more bonding netlink attributes:
primary
primary_reselect
fail_over_mac
xmit_hash_policy
resend_igmp
Tested with modified iproute2 to verify attributes can be set at bond creation
time or set later. Verified sysfs interface to attributes continues to work.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
sfeldma@cumulusnetworks.com [Mon, 16 Dec 2013 00:42:19 +0000 (16:42 -0800)]
bonding: add resend_igmp attribute netlink support
Add IFLA_BOND_RESEND_IGMP to allow get/set of bonding parameter
resend_igmp via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sfeldma@cumulusnetworks.com [Mon, 16 Dec 2013 00:42:12 +0000 (16:42 -0800)]
bonding: add xmit_hash_policy attribute netlink support
Add IFLA_BOND_XMIT_HASH_POLICY to allow get/set of bonding parameter
xmit_hash_policy via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sfeldma@cumulusnetworks.com [Mon, 16 Dec 2013 00:42:05 +0000 (16:42 -0800)]
bonding: add fail_over_mac attribute netlink support
Add IFLA_BOND_FAIL_OVER_MAC to allow get/set of bonding parameter
fail_over_mac via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sfeldma@cumulusnetworks.com [Mon, 16 Dec 2013 00:41:58 +0000 (16:41 -0800)]
bonding: add primary_select attribute netlink support
Add IFLA_BOND_PRIMARY_SELECT to allow get/set of bonding parameter
primary_select via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sfeldma@cumulusnetworks.com [Mon, 16 Dec 2013 00:41:51 +0000 (16:41 -0800)]
bonding: add primary attribute netlink support
Add IFLA_BOND_PRIMARY to allow get/set of bonding parameter
primary via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 15 Dec 2013 21:15:25 +0000 (13:15 -0800)]
pkt_sched: fq: more robust memory allocation
This patch brings NUMA support and automatic fallback to vmalloc()
in case kmalloc() failed to allocate FQ hash table.
NUMA support depends on XPS being setup for the device before
qdisc allocation. After a XPS change, it might be worth creating
qdisc hierarchy again.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Sat, 14 Dec 2013 11:32:10 +0000 (12:32 +0100)]
bondnl: use be32 nla put/get for be32 values
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Sat, 14 Dec 2013 00:40:10 +0000 (16:40 -0800)]
bnad: make local variable static
Compile tested only.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Rasesh Mody <rmody@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 13 Dec 2013 21:51:23 +0000 (13:51 -0800)]
tcp: refine TSO splits
While investigating performance problems on small RPC workloads,
I noticed linux TCP stack was always splitting the last TSO skb
into two parts (skbs). One being a multiple of MSS, and a small one
with the Push flag. This split is done even if TCP_NODELAY is set,
or if no small packet is in flight.
Example with request/response of 4K/4K
IP A > B: . ack 68432 win 2783 <nop,nop,timestamp
6524593 6525001>
IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp
6524593 6525001>
IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp
6524593 6525001>
IP B > A: . ack 68433 win 2768 <nop,nop,timestamp
6525001 6524593>
IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp
6525001 6524593>
IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp
6525001 6524593>
IP A > B: . ack 72528 win 2783 <nop,nop,timestamp
6524593 6525001>
IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp
6524593 6525001>
IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp
6524593 6525001>
We can avoid this split by including the Nagle tests at the right place.
Note : If some NIC had trouble sending TSO packets with a partial
last segment, we would have hit the problem in GRO/forwarding workload already.
tcp_minshall_update() is moved to tcp_output.c and is updated as we might
feed a TSO packet with a partial last segment.
This patch tremendously improves performance, as the traffic now looks
like :
IP A > B: . ack 98304 win 2783 <nop,nop,timestamp
6834277 6834685>
IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp
6834277 6834685>
IP B > A: . ack 98305 win 2768 <nop,nop,timestamp
6834686 6834277>
IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp
6834686 6834277>
IP A > B: . ack 102400 win 2783 <nop,nop,timestamp
6834279 6834686>
IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp
6834279 6834686>
IP B > A: . ack 102401 win 2768 <nop,nop,timestamp
6834687 6834279>
IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp
6834687 6834279>
IP A > B: . ack 106496 win 2783 <nop,nop,timestamp
6834280 6834687>
IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp
6834280 6834687>
IP B > A: . ack 106497 win 2768 <nop,nop,timestamp
6834688 6834280>
IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp
6834688 6834280>
Before :
lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280774
Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
205719.049006 task-clock # 9.278 CPUs utilized
8,449,968 context-switches # 0.041 M/sec
1,935,997 CPU-migrations # 0.009 M/sec
160,541 page-faults # 0.780 K/sec
548,478,722,290 cycles # 2.666 GHz [83.20%]
455,240,670,857 stalled-cycles-frontend # 83.00% frontend cycles idle [83.48%]
272,881,454,275 stalled-cycles-backend # 49.75% backend cycles idle [66.73%]
166,091,460,030 instructions # 0.30 insns per cycle
# 2.74 stalled cycles per insn [83.39%]
29,150,229,399 branches # 141.699 M/sec [83.30%]
1,943,814,026 branch-misses # 6.67% of all branches [83.32%]
22.
173517844 seconds time elapsed
lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests
16851063 0.0
IpExtOutOctets
23878580777 0.0
After patch :
lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280877
Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
107496.071918 task-clock # 4.847 CPUs utilized
5,635,458 context-switches # 0.052 M/sec
1,374,707 CPU-migrations # 0.013 M/sec
160,920 page-faults # 0.001 M/sec
281,500,010,924 cycles # 2.619 GHz [83.28%]
228,865,069,307 stalled-cycles-frontend # 81.30% frontend cycles idle [83.38%]
142,462,742,658 stalled-cycles-backend # 50.61% backend cycles idle [66.81%]
95,227,712,566 instructions # 0.34 insns per cycle
# 2.40 stalled cycles per insn [83.43%]
16,209,868,171 branches # 150.795 M/sec [83.20%]
874,252,952 branch-misses # 5.39% of all branches [83.37%]
22.
175821286 seconds time elapsed
lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests
11239428 0.0
IpExtOutOctets
23595191035 0.0
Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
scheduler.
Many thanks to Neal for review and ideas.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Van Jacobson <vanj@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Fri, 13 Dec 2013 20:35:56 +0000 (12:35 -0800)]
net: remove dead code for add/del multiple
These function to manipulate multiple addresses are not used anywhere
in current net-next tree. Some out of tree code maybe using these but
too bad; they should submit their code upstream..
Also, make __hw_addr_flush local since only used by dev_addr_lists.c
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 17 Dec 2013 20:08:17 +0000 (15:08 -0500)]
Merge branch 'for-davem' of git://git./linux/kernel/git/linville/wireless-next
John W. Linville says:
====================
Please pull this batch of updates for the 3.14 stream...
For the Bluetooth bits, Gustavo says:
"This is the first batch of patches intended for 3.14. There is
nothing big here. Most of the code are refactors, clean up, small
fixes, plus some new device id support."
And...
"More patches to 3.14. Here we have the support for Low Energy
Connection Oriented Channels (LE CoC). Basically, as the name says,
this adds supports for connection oriented channels in the same way
we already have them for BR/EDR connections so profiles/protocols
that work on top of BR/EDR can now work on LE plus a plenty of new
possibilities for LE."
For the ath10k bits, Kalle says:
"Janusz and Marek implemented DFS support to ath10k, but the code is
not enabled yet due to missing cfg80211/mac80211 patches (it will be
enabled in the next pull request). Michal did some device reset fixes
and made it possible for ath10k to share an interrupt with another
device. And lots of smaller fixes from different people."
For the iwlwifi bits, Emmanuel says:
"I have here a big rework of the rate control by Eyal. This is obviously
the biggest part of this batch.
I also have enhancement of protection flags by Avri and a few bits for
WoWLAN by Eliad and Luca. Johannes cleans up the debugfs plus a few
fixes. I provided a few things for Bluetooth coexistence.
Besides this we have an implementation for low priority scan."
Along with all that, there are big batches of updates to mwifiex and
ath9k, Jeff Kirsher's FSF address fix patches, and a handful of other
bits here and there.
Please let me know if there are problems!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 17 Dec 2013 19:42:57 +0000 (14:42 -0500)]
Merge branch 'phy_power'
Sebastian Hesselbarth says:
====================
net: phy: Ethernet PHY powerdown optimization
This is v2 of the ethernet PHY power optimization patches to reduce
power consumption of network PHYs with link that are either unused or
the corresponding netdev is down.
Compared to the last version, this patch set drops a patch to disable
unused PHYs after late initcall, as it is not compatible with a modular
mdio bus [1]. I'll investigate different ways to have a modular mdio bus
driver get notified when driver loading is done.
Again, a branch with v2 applied to v3.13-rc2 can also be found at
https://github.com/shesselba/linux-dove.git topic/ethphy-power-v2
[1] http://www.spinics.net/lists/arm-kernel/msg293028.html
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sebastian Hesselbarth [Fri, 13 Dec 2013 09:20:29 +0000 (10:20 +0100)]
net: phy: suspend phydev when going to HALTED
When phydev is going to HALTED state, we can try to suspend it to
safe more power. phy_suspend helper will check if PHY can be suspended,
so just call it when entering HALTED state.
Signed-off-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Acked-by: Mugunthan V N <mugunthanvnm@ti.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sebastian Hesselbarth [Fri, 13 Dec 2013 09:20:28 +0000 (10:20 +0100)]
net: phy: resume/suspend PHYs on attach/detach
This ensures PHYs are resumed on attach and suspended on detach.
Signed-off-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Acked-by: Mugunthan V N <mugunthanvnm@ti.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sebastian Hesselbarth [Fri, 13 Dec 2013 09:20:27 +0000 (10:20 +0100)]
net: phy: provide phy_resume/phy_suspend helpers
This adds helper functions to resume and suspend a given phy_device
by calling the corresponding driver callbacks if available.
Signed-off-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Acked-by: Mugunthan V N <mugunthanvnm@ti.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sebastian Hesselbarth [Fri, 13 Dec 2013 09:20:26 +0000 (10:20 +0100)]
net: phy: marvell: provide genphy suspend/resume
Marvell PHYs support generic PHY suspend/resume, so provide those
callbacks to all marvell specific drivers.
Signed-off-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Acked-by: Mugunthan V N <mugunthanvnm@ti.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sebastian Hesselbarth [Fri, 13 Dec 2013 09:20:25 +0000 (10:20 +0100)]
net: mv643xx_eth: properly start/stop phy device
When using phydev, it should be phy_start/phy_stop'ed properly. This
driver doesn't do that, so add the corresponding calls to port_start/
stop respectively.
Signed-off-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Acked-by: Mugunthan V N <mugunthanvnm@ti.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wangweidong [Fri, 13 Dec 2013 05:51:04 +0000 (13:51 +0800)]
sctp: Reorder 'struc association' members to reduce its size
Members of 'struct association' are not in appropriate order to
reuse compiler added padding on 64bit architectures. In this patch
we reorder those struct members and help reduce the size of the
structure from 2776 bytes to 2720 bytes on 64 bit architectures.
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 17 Dec 2013 19:31:17 +0000 (14:31 -0500)]
Merge branch 'master' of git://git./linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates
This series contains updates to i40e only (again).
Jesse provides a fix for when tx_rings structure is NULL and we do not want
to panic. Then refactors the flow control set up and disables L2 flow control
by default. Provides some trivial fixes as well as prevent compiler warnings.
Then to align to similar behaviour in ixgbe, use the total number of CPUs in
the system to suggest the number of transmit and receive queue pairs.
Shannon provides a i40e ethtool fix to get some more reasonable information
reports back out to the ethtool. In addition, fixes PF reset after offline
test, where it reorders the test to put the register test last as it is the
only one that needs a reset, and we wait to trigger the reset until after we
clear the testing bit. Lastly provides basic support for handling suspend
and resume for now, later on Wake-On-LAN support will be added.
Anjali provides changes to tell the stack about our actual number of queues
in order for RFS/RPS/XFS to work correctly. Then provides several patches to
implement dynamically changing the queue count for the main VSI. Adds
basic support for get/set channels for RSS so that the number of receive and
transmit queue pair can be changed via ethtool. Cleans up the use of
rtnl_lock in the reset patch since it runs from a work time.
Neerav Parikh cleans up the VF interface to remove FCoE code as this
feature will not be supported on VF interfaces.
v2:
- submitted patch 1 to net (since it was a fix needed for net), so dropped
from this series (this patch will get added to net-next when Dave syncs
his trees)
- Dropped patches 4 & 11 from previous submission because of feedback
received from Ben Hutchings and Sergei Shtylyov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 17 Dec 2013 19:27:44 +0000 (14:27 -0500)]
Merge branch 'ovs_hash'
Francesco Fusco says:
====================
ovs: introduce arch-specific fast hashing improvements
From: Daniel Borkmann <dborkman@redhat.com>
We are introducing a fast hash function (see patch1) that can be
used in the context of OpenVSwitch to reduce the hashing footprint
(patch2). For details, please see individual patches!
v1->v2:
- Make hash generic and place it under lib
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Francesco Fusco [Thu, 12 Dec 2013 15:09:06 +0000 (16:09 +0100)]
net: ovs: use CRC32 accelerated flow hash if available
Currently OVS uses jhash2() for calculating flow hashes in its
internal flow_hash() function. The performance of the flow_hash()
function is critical, as the input data can be hundreds of bytes
long.
OVS is largely deployed in x86_64 based datacenters. Therefore,
we argue that the performance critical fast path of OVS should
exploit underlying CPU features in order to reduce the per packet
processing costs. We replace jhash2 with the hash implementation
provided by the kernel hash lib, which exploits the crc32l
instruction to achieve high performance
Our patch greatly reduces the hash footprint from ~200 cycles of
jhash2() to around ~90 cycles in case of ovs_flow_hash_crc()
(measured with rdtsc over maximum length flow keys on an i7 Intel
CPU).
Additionally, we wrote a microbenchmark to stress the flow table
performance. The benchmark inserts random flows into the flow
hash and then performs lookups. Our hash deployed on a CRC32
capable CPU reduces the lookup for 1000 flows, 100 masks from
~10,100us to ~6,700us, for example.
Thus, simply use the newly introduced arch_fast_hash2() as a
drop-in replacement.
Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Thomas Graf <tgraf@redhat.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Francesco Fusco [Thu, 12 Dec 2013 15:09:05 +0000 (16:09 +0100)]
lib: introduce arch optimized hash library
We introduce a new hashing library that is meant to be used in
the contexts where speed is more important than uniformity of the
hashed values. The hash library leverages architecture specific
implementation to achieve high performance and fall backs to
jhash() for the generic case.
On Intel-based x86 architectures, the library can exploit the crc32l
instruction, part of the Intel SSE4.2 instruction set, if the
instruction is supported by the processor. This implementation
is twice as fast as the jhash() implementation on an i7 processor.
Additional architectures, such as Arm64 provide instructions for
accelerating the computation of CRC, so they could be added as well
in follow-up work.
Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Thomas Graf <tgraf@redhat.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
tanxiaojun [Fri, 13 Dec 2013 06:49:56 +0000 (14:49 +0800)]
fddi: cleanup unsigned to unsigned int/short
Use "unsigned int/short" instead of "unsigned", and change the type of
iteration variable "i" to "unsigned int".
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Hutchings [Mon, 16 Dec 2013 18:56:24 +0000 (18:56 +0000)]
sfc: Fix RX drop filters for EF10
When we insert an filter, the firmware checks that the given RX queue
index is in range even if it will not be used. In case we're
inserting a drop filter, pass the value 0.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
wangweidong [Thu, 12 Dec 2013 01:36:42 +0000 (09:36 +0800)]
tipc: change lock_sock order in connect()
Instead of reaquiring the socket lock and taking the normal exit
path when a connection times out, we bail out early with a
return -ETIMEDOUT.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wangweidong [Thu, 12 Dec 2013 01:36:41 +0000 (09:36 +0800)]
tipc: Use <linux/uaccess.h> instead of <asm/uaccess.h>
As warned by checkpatch.pl, use #include <linux/uaccess.h>
instead of <asm/uaccess.h>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wangweidong [Thu, 12 Dec 2013 01:36:40 +0000 (09:36 +0800)]
tipc: kill unnecessary goto's
Remove a number of needless 'goto exit' in send_stream
when the socket is in an unconnected state.
This patch is cosmetic and does not alter the operation of
TIPC in any way.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wangweidong [Thu, 12 Dec 2013 01:36:39 +0000 (09:36 +0800)]
tipc: remove unnecessary variables and conditions
We remove a number of unnecessary variables and branches
in TIPC. This patch is cosmetic and does not change the
operation of TIPC in any way.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neerav Parikh [Tue, 26 Nov 2013 10:49:24 +0000 (10:49 +0000)]
i40e: Remove FCoE in i40e_virtchnl_pf.c code
Remove FCoE code from the VF interface, as the feature will
not be supported on VF interfaces.
Change-Id: Ie9db04fa2e37fa14ac3e73a9c20980348d931357
Signed-off-by: Neerav Parikh <Neerav.Parikh@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Tue, 26 Nov 2013 10:49:23 +0000 (10:49 +0000)]
i40e: support for suspend and resume
Add basic support for handling suspend and resume. We'll add
Wake-on-LAN support later.
Change-Id: Iea5e11c81bd9289a5bdbf086de8f626911a0b5ce
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anjali Singhai Jain [Tue, 26 Nov 2013 10:49:22 +0000 (10:49 +0000)]
i40e: rtnl_lock in reset path fixes
Any user-initiated path which eventually calls reset needs
to hold the rtnl_lock, so add functionality to do that.
Be careful not to use the safe reset when cleaning up
from the diagnostic tests, which avoids rtnl_lock
recursion from ethtool.
Protect the reset_task with rtnl_lock, since it runs from a work item.
Change-Id: Ib6e7a3fb2966809db2daf35fd5a123ccdf6f6f0f
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anjali Singhai Jain [Tue, 26 Nov 2013 11:59:30 +0000 (11:59 +0000)]
i40e: Add basic support for get/set channels for RSS
Implement the number of receive/transmit queue pair being
changed on the fly by ethtool.
Change-Id: I70df2363f1ca40b63610baa561c5b6b92b81bca7
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anjali Singhai Jain [Tue, 26 Nov 2013 10:49:19 +0000 (10:49 +0000)]
i40e: function to reconfigure RSS queues and rebuild
This is the second of 3 patches that allows for changing
the number of queues in the driver on the fly.
This patch adds a function that calls the reinit flow for the
main VSI after making changes to the RSS queue count as requested
by the user.
Change-Id: I82dee91e9fe90eeb4e84a7369f4b8b342155dd85
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anjali Singhai Jain [Tue, 26 Nov 2013 10:49:18 +0000 (10:49 +0000)]
i40e: reinit flow for the main VSI
This patch is the first in a 3 series patchset to implement
dynamically changing the queue count for the main VSI.
This patch starts by adding a reinit flow. This flow is designed
to be able to change just the queue count and not the number of
interrupt vectors that the device originally came up with.
Change-Id: I0634aaebf7dc4dd6c66af8f9dbbef89d7beac438
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jesse Brandeburg [Tue, 26 Nov 2013 10:49:17 +0000 (10:49 +0000)]
i40e: use same number of queues as CPUs
The current driver default sets the number of transmit/receive
queue pairs based on the current node's CPU count.
A better method is to use the total number of CPUs in the system
to suggest the number of queue pairs, which aligns better with
the behavior of ixgbe, and also with the expectations of the
kernel XPS and other subsystems in the stack.
Change-Id: If3e20c7f100f13e51d69762594d948f247ffe0c8
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jesse Brandeburg [Tue, 26 Nov 2013 10:49:16 +0000 (10:49 +0000)]
i40e: trivial fixes
Prevent some compiler warnings and implement some other
trivial fixes.
Change-Id: I7f49d79b91b94df1ad4a8306a0410ed72238845f
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jesse Brandeburg [Tue, 26 Nov 2013 10:49:15 +0000 (10:49 +0000)]
i40e: init flow control settings to disabled
Refactor flow control set up and disable L2 flow
control by default.
Change-Id: I2fe257b80df6d9a1e37deb4df118da8f8467040d
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anjali Singhai Jain [Tue, 26 Nov 2013 10:49:14 +0000 (10:49 +0000)]
i40e: Tell the stack about our actual number of queues
Call the netif_set_real* functions in order to make sure
the stack knows about how many queues we have, in order
for RFS/RPS/XFS to work correctly.
Change-Id: Ib7a7b2792f80c5eef210dedf42cc6607d63953d2
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Tue, 26 Nov 2013 10:49:12 +0000 (10:49 +0000)]
i40e: fix pf reset after offline test
When the ethtool testing starts it sets the I40E_TESTING state
bit, which blocks new netdev opens so that things don't get
confused, while the testing might be messing with register and
other things. Unfortunately, that was keeping the PF resets
after the register test from working correctly because the netdev
would not get reopened. This patch reorders the tests to put the
register test last as it is the only one that needs a reset, and
we wait to trigger the reset until after we clear the
I40E_TESTING bit.
Change-Id: Ieaa18d74264250ac336b0656b490125ee8a22d2a
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jesse Brandeburg [Tue, 26 Nov 2013 10:49:10 +0000 (10:49 +0000)]
i40e: fix up some of the ethtool connection reporting
Get some more reasonable information reported back out to ethtool
for the different types of connections supported.
Change-Id: I57b153f86b9cdd04ad7cb5bf7d1c45873c196a7a
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jerry Chu [Mon, 16 Dec 2013 02:48:07 +0000 (18:48 -0800)]
net-ipv6: Fix alleged compiler warning in ipv6_exthdrs_len()
It was reported that Commit
299603e8370a93dd5d8e8d800f0dff1ce2c53d36
("net-gro: Prepare GRO stack for the upcoming tunneling support")
triggered a compiler warning in ipv6_exthdrs_len():
net/ipv6/ip6_offload.c: In function ‘ipv6_gro_complete’:
net/ipv6/ip6_offload.c:178:24: warning: ‘optlen’ may be used uninitialized in this function [-Wmaybe-u
opth = (void *)opth + optlen;
^
net/ipv6/ip6_offload.c:164:22: note: ‘optlen’ was declared here
int len = 0, proto, optlen;
^
Note that there was no real bug here - optlen was never uninitialized
before use. (Was the version of gcc I used smarter to not complain?)
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 15 Dec 2013 03:33:45 +0000 (22:33 -0500)]
Merge branch 'for-davem' of git://git./linux/kernel/git/bwh/sfc-next
Ben Hutchings says:
====================
1. Change PTP clock name to 'sfc'.
2. Complete support for hardware timestamping and PTP clock on the
SFC9100 family.
3. Various cleanups for the PTP code.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Hannes Frederic Sowa [Sat, 14 Dec 2013 06:29:29 +0000 (07:29 +0100)]
ipv6: fix compiler warning in ipv6_exthdrs_len
Commit
299603e8370a93dd5d8e8d800f0dff1ce2c53d36 ("net-gro: Prepare GRO
stack for the upcoming tunneling support") used an uninitialized variable
which leads to the following compiler warning:
net/ipv6/ip6_offload.c: In function ‘ipv6_gro_complete’:
net/ipv6/ip6_offload.c:178:24: warning: ‘optlen’ may be used uninitialized in this function [-Wmaybe-uninitialized]
opth = (void *)opth + optlen;
^
net/ipv6/ip6_offload.c:164:22: note: ‘optlen’ was declared here
int len = 0, proto, optlen;
^
Fix it up.
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 14 Dec 2013 06:58:19 +0000 (01:58 -0500)]
Merge branch 'bonding_rcu'
Ding Tianhong says:
====================
bonding: rebuild the lock use for bond monitor
Now the bond slave list is not protected by bond lock, only by RTNL,
but the monitor still use the bond lock to protect the slave list,
it is useless, according to the Veaceslav's opinion, there were
three way to fix the protect problem:
1. add bond_master_upper_dev_link() and bond_upper_dev_unlink()
in bond->lock, but it is unsafe to call call_netdevice_notifiers()
in write lock.
2. remove unused bond->lock for monitor function, only use the exist
rtnl lock(), it will take performance loss in fast path.
3. use RCU to protect the slave list, of course, performance is better,
but in slow path, it is ignored.
obviously the solution 1 is not fit here, I will consider the 2 and 3
solution. My principle is simple, if in fast path, RCU is better,
otherwise in slow path, both is well, but according to the Jay Vosburgh's
opinion, the monitor will loss performace if use RTNL to protect the all
slave list, so remove the bond lock and replace with RCU.
The second problem is the curr_slave_lock for bond, it is too old and
unwanted in many place, because the curr_active_slave would only be
changed in 3 place:
1. enslave slave.
2. release slave.
3. change active slave.
all above were already holding bond lock, RTNL and curr_slave_lock
together, it is tedious and no need to add so mach lock, when change
the curr_active_slave, you have to hold the RTNL and curr_slave_lock
together, and when you read the curr_active_slave, RTNL or curr_slave_lock,
any one of them is no problem.
for the stability, I did not change the logic for the monitor,
all change is clear and simple, I have test the patch set for lockdep,
it work well and stability.
v2. accept the Jay Vosburgh's opinion, remove the RTNL and replace with RCU,
also add some rcu function for bond use, so the patch set reach 10.
v3. accept the Nikolay Aleksandrov's opinion, remove no needed bond_has_slave_rcu(),
add protection for several 3ad mode handler functions and current_arp_slave.
rebuild the bond_first_slave_rcu(), make it more clear.
v4. because the struct netdev_adjacent should not be exist in netdevice.h, so I have
to make a new function to support micro bond_first_slave_rcu().
also add a new patch to simplify the bond_resend_igmp_join_requests_delayed().
v5. according the Jay Vosburgh's opinion, in patch 2 and 6, the calling of notify
peer is hardly to happen with the bond_xxx_commit() when the monitoring is running,
so the performance impact about make two round trips to one trip on RTNL is minimal,
no need to do that,the reason is very clear, so modify the patch 2 and 6, recover
the notify peer in RTNL alone.
====================
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Fri, 13 Dec 2013 02:20:26 +0000 (10:20 +0800)]
bonding: rebuild the bond_resend_igmp_join_requests_delayed()
The bond_resend_igmp_join_requests_delayed() and
bond_resend_igmp_join_requests() should be integrated,
because the bond_resend_igmp_join_requests_delayed() did
nothing except bond_resend_igmp_join_requests().
The bond igmp_retrans could only be changed in bond_change_active_slave
and here, bond_change_active_slave will be called in RTNL and curr_slave_lock,
the bond_resend_igmp_join_requests already hold RTNL, so no need
to free RTNL and hold curr_slave_lock again, it may be a small optimization,
so move the igmp_retrans in RTNL and remove the curr_slave_lock.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Fri, 13 Dec 2013 02:20:22 +0000 (10:20 +0800)]
bonding: remove unwanted lock for bond_store_primaryxxx()
The bond_select_active_slave() will not release and acquire
bond lock, so it is no need to read the bond lock for them,
and the bond_store_primaryxxx() is already in RTNL, so remove the
unwanted lock.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Fri, 13 Dec 2013 02:20:17 +0000 (10:20 +0800)]
bonding: remove unwanted lock for bond_option_active_slave_set()
The bond_option_active_slave_set() is always called in RTNL,
the RTNL could protect bond slave list, so remove the unwanted
bond lock.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Fri, 13 Dec 2013 02:20:12 +0000 (10:20 +0800)]
bonding: add RCU for bond_3ad_state_machine_handler()
The bond_3ad_state_machine_handler() use the bond lock to protect
the bond slave list and slave port together, but it is not enough,
the bond slave list was link and unlink in RTNL, not bond lock,
so I add RCU to protect the slave list from leaving.
The bond lock is still used here, because when the slave has been
removed from the list by the time the state machine runs, it appears
to be possible for both function to manupulate the same aggregator->lag_ports
by finding the aggregator via two different ports that are both members of
that aggregator (i.e., port A of the agg is being unbound, and port B
of the agg is runing its state machine).
If I remove the bond lock, there are nothing to mutex changes
to aggregator->lag_ports between bond_3ad_state_machine_handler and
bond_3ad_unbind_slave, So the bond lock is the simplest way to protect
aggregator->lag_ports.
There was a lot of function need RCU protect, I have two choice
to make the function in RCU-safe, (1) create new similar functions
and make the bond slave list in RCU. (2) modify the existed functions
and make them in read-side critical section, because the RCU
read-side critical sections may be nested.
I choose (2) because it is no need to create more similar functions.
The nots in the function is still too old, clean up the nots.
Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Fri, 13 Dec 2013 02:20:07 +0000 (10:20 +0800)]
bonding: remove unwanted lock for bond enslave and release
The bond_change_active_slave() and bond_select_active_slave()
do't need bond lock anymore, so remove the unwanted bond lock
for these two functions.
The bond_select_active_slave() will release and acquire
curr_slave_lock, so the curr_slave_lock need to protect
the function.
In bond enslave and bond release, the bond slave list is also
protected by RTNL, so bond lock is no need to exist, remove
the lock and clean the functions.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Fri, 13 Dec 2013 02:20:02 +0000 (10:20 +0800)]
bonding: rebuild the lock use for bond_activebackup_arp_mon()
The bond_activebackup_arp_mon() use the bond lock for read to
protect the slave list, it is no effect, and the RTNL is only
called for bond_ab_arp_commit() and peer notify, for the performance
better, use RCU to replace with the bond lock, to the bond slave
list need to called in RCU, add a new bond_first_slave_rcu()
to get the first slave in RCU protection.
In bond_ab_arp_probe(), the bond->current_arp_slave may changd
if bond release slave, just like:
bond_ab_arp_probe() bond_release()
cpu 0 cpu 1
...
if (bond->current_arp_slave...) ...
... bond->current_arp_slave = NULl
bond->current_arp_slave->dev->name ...
So the current_arp_slave need to dereference in the section.
Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Fri, 13 Dec 2013 02:19:55 +0000 (10:19 +0800)]
bonding: create bond_first_slave_rcu()
The bond_first_slave_rcu() will be used to instead of bond_first_slave()
in rcu_read_lock().
According to the Jay Vosburgh's suggestion, the struct netdev_adjacent
should hide from users who wanted to use it directly. so I package a
new function to get the first slave of the bond.
Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Fri, 13 Dec 2013 02:19:50 +0000 (10:19 +0800)]
bonding: rebuild the lock use for bond_loadbalance_arp_mon()
The bond_loadbalance_arp_mon() use the bond lock to protect the
bond slave list, it is no effect, so I could use RTNL or RCU to
replace it, considering the performance impact, the RCU is more
better here, so the bond lock replace with the RCU.
The bond_select_active_slave() need RTNL and curr_slave_lock
together, but there is no RTNL lock here, so add a rtnl_rtylock.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Fri, 13 Dec 2013 02:19:45 +0000 (10:19 +0800)]
bonding: rebuild the lock use for bond_alb_monitor()
The bond_alb_monitor use bond lock to protect the bond slave list,
it is no effect here, we need to use RTNL or RCU to replace bond lock,
the bond_alb_monitor will called 10 times one second, RTNL may loss
performance here, so I replace bond lock with RCU to protect the
bond slave list, also the RTNL is preserved, the logic of the monitor
did not changed.
Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Fri, 13 Dec 2013 02:19:39 +0000 (10:19 +0800)]
bonding: rebuild the lock use for bond_mii_monitor()
The bond_mii_monitor() still use bond lock to protect bond slave list,
it is no effect, I have 2 way to fix the problem, move the RTNL to the
top of the function, or add RCU to protect the bond slave list,
according to the Jay Vosburgh's opinion, 10 times one second is a
truely big performance loss if use RTNL to protect the whole monitor,
so I would take the advice and use RCU to protect the bond slave list.
The bond_has_slave() will not protect by anything, there will no things
happen if the slave list is be changed, unless the bond was free, but
it will not happened before the monitor, the bond will closed before
be freed.
The peers notify for the bond will calling curr_active_slave, so
derefence the slave to make sure we will accessing the same slave
if the curr_active_slave changed, as the rcu dereference need in
read-side critical sector and bond_change_active_slave() will call
it with no RCU hold, so add peer notify in rcu_read_lock which
will be nested in monitor.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Fri, 13 Dec 2013 02:19:32 +0000 (10:19 +0800)]
bonding: remove the no effect lock for bond_select_active_slave()
The bond slave list was no longer protected by bond lock and only
protected by RTNL or RCU, so anywhere that use bond lock to protect
slave list is meaningless.
remove the release and acquire bond lock for bond_select_active_slave().
The curr_active_slave could only be changed in 3 place:
1. enslave slave.
2. release slave.
3. change_active_slave.
all above place were holding bond lock, RTNL and curr_slave_lock
together, it is tedious and meaningless, obviously bond lock is no
need here, but RTNL or curr_slave_lock is needed, so if you want
to access active slave, you have to choose one lock, RTNL or
curr_slave_lock, if RTNL is exist, no need to add curr_slave_lock,
otherwise curr_slave_lock is better, because of the performance.
there are several place calling bond_select_active_slave() and
bond_change_active_slave(), the next step I will clean these place
and remove the no effect lock.
there are some document changed together when update the function.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 12 Dec 2013 23:41:56 +0000 (15:41 -0800)]
pkt_sched: set root qdisc before change() in attach_default_qdiscs()
After commit
95dc19299f74 ("pkt_sched: give visibility to mq slave
qdiscs") we call disc_list_add() while the device qdisc might be
the noop_qdisc one.
This shows up as duplicates in "tc qdisc show", as all inactive devices
point to noop_qdisc.
Fix this by setting dev->qdisc to the new qdisc before calling
ops->change() in attach_default_qdiscs()
Add a WARN_ON_ONCE() to catch any future similar problem.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 14 Dec 2013 06:11:22 +0000 (01:11 -0500)]
Merge branch 'for-davem' of git://git./linux/kernel/git/bwh/sfc-next
Ben Hutchings says:
====================
An assortment of changes for Linux 3.14:
1. Merge the sfc fixes that you have already merged into net.git.
(The branch point for those was such that this does not bring in any
other changes.)
2. Reduce log level for a generally useless warning message, from
Robert Stonehouse.
3. Include BISTs in ethtool offline self-test for EF10 and recover from
BISTs initiated through other functions, from Jon Cooper.
4. Improve a sanity check on RX completions.
5. Avoid incrementing RX dropped count while the interface is down, from
Jon Cooper.
6. Improve hardware sensor naming and log messages, from Edward Cree.
7. Log all unexpected errors returned by firmware, from Edward Cree.
8. Expose another NVRAM partition to userland.
9. Some refactoring of the PTP code in preparation for EF10 support.
10. Various minor cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 14 Dec 2013 06:08:14 +0000 (01:08 -0500)]
Merge branch 'bonding_netlink'
Scott Feldman says:
====================
bonding: add more netlink attributes
v2:
Addressed v1 review comments. In particular, Jay's concern about
current sysfs ordering limitations carrying over to iproute. Netlink
attributes are processed in a priority order in
bond_netlink.c:bond_changelink(). Lower priority attributes can't undo
higher priority attributes when attempting to set both with iproute
command. For example, this command will fail:
ip link add bond1 type bond mode active-backup miimon 10 arp_interval 10
Because we're trying to create a new bond to use incompatible miimon
and ARP interval attributes. However, if attributes are applied
one-at-a-time, previously applied attributes can be overridden:
ip link add bond1 type bond mode active-backup miimon 10
ip link set dev bond1 type bond arp_interval 10
These two commands succeed. The bond is first created to use miimon.
Next, the bond is converted to use ARP interval, which undoes miimon.
v1:
Following Jiri Pirko's lead, add more bonding netlink attributes. Sending
matching iproute2 patch separately. sysfs access to attributes is
retained.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
sfeldma@cumulusnetworks.com [Thu, 12 Dec 2013 22:10:45 +0000 (14:10 -0800)]
bonding: add arp_all_targets netlink support
Add IFLA_BOND_ARP_ALL_TARGETS to allow get/set of bonding parameter
arp_all_targets via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
sfeldma@cumulusnetworks.com [Thu, 12 Dec 2013 22:10:38 +0000 (14:10 -0800)]
bonding: add arp_validate netlink support
Add IFLA_BOND_ARP_VALIDATE to allow get/set of bonding parameter
arp_validate via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
sfeldma@cumulusnetworks.com [Thu, 12 Dec 2013 22:10:31 +0000 (14:10 -0800)]
bonding: add arp_ip_target netlink support
Add IFLA_BOND_ARP_IP_TARGET to allow get/set of bonding parameter
arp_ip_target via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>