GitHub/moto-9609/android_kernel_motorola_exynos9610.git
12 years agoixgbe: Update link flow control to correctly handle multiple packet buffer DCB
Alexander Duyck [Thu, 19 Apr 2012 17:48:48 +0000 (17:48 +0000)]
ixgbe: Update link flow control to correctly handle multiple packet buffer DCB

This change updates the link flow control configuration so that we
correctly set the link flow control settings for DCB.  Previously we would
have to call the fc_enable call 8 times, once for each packet buffer.  If
we move that logic into the fc_enable call itself we can avoid multiple
unnecessary register writes.

This change also corrects an issue in which we were only shifting the water
marks for 82599 parts by 6 instead of 10.  This was resulting in us only
using 1/16 of the packet buffer when flow control was enabled.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoixgbe: Reorder link flow control functions in ixgbe_common.c
Alexander Duyck [Thu, 19 Apr 2012 17:49:56 +0000 (17:49 +0000)]
ixgbe: Reorder link flow control functions in ixgbe_common.c

We can avoid many of the forward declarations found in ixgbe_common.c by
just reordering things so this patch does that to help cleanup the code.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoixgbe: Use __free_pages instead of put_page to release pages
Alexander Duyck [Fri, 6 Apr 2012 04:24:50 +0000 (04:24 +0000)]
ixgbe: Use __free_pages instead of put_page to release pages

This change replaces the calls to put_page with calls to __free_page.

Since the FCoE code is able to access order 1 pages I thought it would be a
good idea to change things over to using __free_pages since that is the
preferred approach for freeing pages.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoixgbe: Make ixgbe_fc_autoneg return void and always set current_mode
Alexander Duyck [Wed, 28 Mar 2012 08:03:48 +0000 (08:03 +0000)]
ixgbe: Make ixgbe_fc_autoneg return void and always set current_mode

This change makes it so that ixgbe_fc_autoneg is a void and always sets the
current_mode.  Previously if the link was down we would return an error,
however there is no harm in simply treating a link down case as a case in
which autoneg simply failed.  This allows us to rely on the return value of
the ixgbe_fc_enable call now since there should be no cases where it
returns an error that would normally be ignored.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoixgbe: Reorder the ring to q_vector mapping to improve performance
Alexander Duyck [Wed, 28 Mar 2012 08:03:43 +0000 (08:03 +0000)]
ixgbe: Reorder the ring to q_vector mapping to improve performance

This change reorders the mapping of rings to q_vectors in the case that the
number of rings exceeds the number of q_vectors.  Previously we would
allocate the first R/N queues to the first q_vector where R is the number
of rings and N is the number of q_vectors.  Instead of doing this we can do
a better job of interleaving the rings to the CPUs by assigning every Nth
ring to the q_vector.

The below tables illustrate this change for the R = 16 N = 4 case.
          Before patch  After patch
q_vector:  0  1  2  3    0  1  2  3
Rings:     0  4  8 12    0  1  2  3
           1  5  9 13    4  5  6  7
           3  6 10 14    8  9 10 11
           4  7 11 15   12 13 14 15

This should improve the performance for both DCB or ATR when the number of
rings exceeds the number of q_vectors allocated by the adapter.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoixgbe: Track instances of buffer available but no DMA resources present
Alexander Duyck [Wed, 28 Mar 2012 08:03:32 +0000 (08:03 +0000)]
ixgbe: Track instances of buffer available but no DMA resources present

This change makes it so that we can track instances of where a packet was
dropped due to a packet being received when there are no DMA buffers
available in the ring.

For some reason this was only being enabled with RSC, however it makes
more sense to always have this feature on so that we can track any cases
where we might drop a buffer due to an Rx ring being full.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoe1000e: initial support for i217
Bruce Allan [Thu, 19 Apr 2012 03:21:47 +0000 (03:21 +0000)]
e1000e: initial support for i217

i217 is the next-generation LOM that will be available on systems with the
Lynx Point Platform Controller Hub (PCH) chipset from Intel.  This patch
provides the initial support for the device.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoe1000e: Update driver version number
Matthew Vick [Wed, 25 Apr 2012 04:45:57 +0000 (04:45 +0000)]
e1000e: Update driver version number

Version bump to 1.11.3-k.

Signed-off-by: Matthew Vick <matthew.vick@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agonet/niu: remove one superfluous dma mask check
Sebastian Andrzej Siewior [Thu, 3 May 2012 18:22:00 +0000 (20:22 +0200)]
net/niu: remove one superfluous dma mask check

The idea here seems to be to get a 44bit DMA mask working and if this
fails it should fallback to a 32bit DMA mask. The dma_mask variable is
assigned once to 44bit and never updated. pci_set_dma_mask() and
pci_set_consistent_dma_mask() are both implemented as functions so there
is no evil macro which might update dma_mask. Looking at the assembly, I
see a call to dma_set_mask() followed by dma_supported() and then a jump
passed the second dma_set_mask(). The only way to get to second
dma_set_mask() call is by an error code in the first one.

So I hereby remove the check since it looks superfluous. Please ignore
the path if there is black magic involved.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net...
David S. Miller [Thu, 3 May 2012 17:30:11 +0000 (13:30 -0400)]
Merge branch 'master' of git://git./linux/kernel/git/jkirsher/net-next

12 years agoskb: Add skb_head_is_locked helper function
Alexander Duyck [Thu, 3 May 2012 01:09:42 +0000 (01:09 +0000)]
skb: Add skb_head_is_locked helper function

This patch adds support for a skb_head_is_locked helper function.  It is
meant to be used any time we are considering transferring the head from
skb->head to a paged frag.  If the head is locked it means we cannot remove
the head from the skb so it must be copied or we must take the skb as a
whole.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet: Fix truesize accounting in skb_gro_receive()
Eric Dumazet [Wed, 2 May 2012 23:33:21 +0000 (23:33 +0000)]
net: Fix truesize accounting in skb_gro_receive()

GRO is very optimistic in skb truesize estimates, only taking into
account the used part of fragments.

Be conservative, and use more precise computation, so that bloated GRO
skbs can be collapsed eventually.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoixgbevf: Update version string
Greg Rose [Tue, 17 Apr 2012 04:29:39 +0000 (04:29 +0000)]
ixgbevf: Update version string

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoixgbevf: Make sure jumbo frames are set correctly after PF reset
Greg Rose [Tue, 17 Apr 2012 04:29:34 +0000 (04:29 +0000)]
ixgbevf: Make sure jumbo frames are set correctly after PF reset

If the Physical Function (PF) resets after the VF has set jumbo
frame MTU then the VF jumbo frame is overwritten.  Make sure the
VF driver always requests proper MTU size after reset
synchronization.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Sibai Li <sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoixgbevf: Add support to recognize 100mb link speed
Greg Rose [Tue, 10 Apr 2012 01:56:37 +0000 (01:56 +0000)]
ixgbevf: Add support to recognize 100mb link speed

The X540 10Gig controller is capable of linking at 100Mbits - add
support for reporting that link speed.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Sibai Li <sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoe1000e: Remove special case for 82573/82574 ASPM L1 disablement
Chris Boot [Tue, 24 Apr 2012 07:24:58 +0000 (07:24 +0000)]
e1000e: Remove special case for 82573/82574 ASPM L1 disablement

For the 82573, ASPM L1 gets disabled wholesale so this special-case code
is not required. For the 82574 the previous patch does the same as for
the 82573, disabling L1 on the adapter. Thus, this code is no longer
required and can be removed.

Signed-off-by: Chris Boot <bootc@bootc.net>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoe1000e: Disable ASPM L1 on 82574
Chris Boot [Tue, 24 Apr 2012 07:24:52 +0000 (07:24 +0000)]
e1000e: Disable ASPM L1 on 82574

ASPM on the 82574 causes trouble. Currently the driver disables L0s for
this NIC but only disables L1 if the MTU is >1500. This patch simply
causes L1 to be disabled regardless of the MTU setting.

Signed-off-by: Chris Boot <bootc@bootc.net>
Cc: "Wyborny, Carolyn" <carolyn.wyborny@intel.com>
Cc: Nix <nix@esperi.org.uk>
Link: https://lkml.org/lkml/2012/3/19/362
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoe1000e: Driver workaround for IPv6 Header Extension Erratum.
Matthew Vick [Wed, 25 Apr 2012 08:01:05 +0000 (08:01 +0000)]
e1000e: Driver workaround for IPv6 Header Extension Erratum.

Previously, IPv6 extension header parsing was disabled for all devices
supported by e1000e when using packet split mode. However, as per a
silicon errata, only certain devices need this restriction and will need
to disable IPv6 extension header parsing for all modes.

Signed-off-by: Matthew Vick <matthew.vick@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoe1000e: Resolve intermittent negotiation issue on 82574/82583.
Matthew Vick [Wed, 25 Apr 2012 07:25:18 +0000 (07:25 +0000)]
e1000e: Resolve intermittent negotiation issue on 82574/82583.

For 82574 and 82583 devices, resolve an intermittent link issue where
the link negotiates to 100Mbps rather than 1Gbps when powering off the
PHY and powering on the PHY after several seconds.

Signed-off-by: Matthew Vick <matthew.vick@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoe1000e: cleanup long [read|write]_reg_locked PHY ops function pointers
Bruce Allan [Sat, 14 Apr 2012 04:21:52 +0000 (04:21 +0000)]
e1000e: cleanup long [read|write]_reg_locked PHY ops function pointers

Calling the locked versions of the read/write PHY ops function pointers
often produces excessively long lines.  Shorten these as is done with
the non-locked versions of the PHY register read/write functions.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoe1000e: suggest a possible workaround to a device hang on 82577/8
Bruce Allan [Tue, 20 Mar 2012 03:48:08 +0000 (03:48 +0000)]
e1000e: suggest a possible workaround to a device hang on 82577/8

There is a known issue in the 82577 and 82578 device that can cause a hang
in the device hardware during traffic stress; the current workaround in the
driver is to disable transmit flow control by default.  If the user enables
transmit flow control and the device hang occurs, provide a message in the
syslog suggesting to re-enable the workaround.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoixgbe: Fix use after free on module remove
Alexander Duyck [Wed, 2 May 2012 21:19:14 +0000 (21:19 +0000)]
ixgbe: Fix use after free on module remove

While testing the TCP changes I had to fix an issue in order to be able to
load and unload the module.

The recent patch that added thermal sensor support added a use after free
bug on module unload with an 82598 adapter in the system.  To resolve the
issue I have updated the code so that when we free the info_kobj we set it
back to NULL.

I suspect there are likely other bugs present, but I will leave that for
another patch that can undergo more testing.

I am submitting this directly to net-next since this fixes a fairly serious
bug that will lock up the ixgbe module until the system is rebooted.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotcp: move stats merge to the end of tcp_try_coalesce
Alexander Duyck [Wed, 2 May 2012 21:19:09 +0000 (21:19 +0000)]
tcp: move stats merge to the end of tcp_try_coalesce

This change cleans up the last bits of tcp_try_coalesce so that we only
need one goto which jumps to the end of the function.  The idea is to make
the code more readable by putting things in a linear order so that we start
execution at the top of the function, and end it at the bottom.

I also made a slight tweak to the code for handling frags when we are a
clone.  Instead of making it an if (clone) loop else nr_frags = 0 I changed
the logic so that if (!clone) we just set the number of frags to 0 which
disables the for loop anyway.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotcp: Move code related to head frag in tcp_try_coalesce
Alexander Duyck [Wed, 2 May 2012 21:19:04 +0000 (21:19 +0000)]
tcp: Move code related to head frag in tcp_try_coalesce

This change reorders the code related to the use of an skb->head_frag so it
is placed before we check the rest of the frags.  This allows the code to
read more linearly instead of like some sort of loop.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotcp: Fix truesize accounting in tcp_try_coalesce
Alexander Duyck [Wed, 2 May 2012 21:18:59 +0000 (21:18 +0000)]
tcp: Fix truesize accounting in tcp_try_coalesce

This patch addresses several issues in the way we were tracking the
truesize in tcp_try_coalesce.

First it was using ksize which prevents us from having a 0 sized head frag
and getting a usable result.  To resolve that this patch uses the end
pointer which is set based off either ksize, or the frag_size supplied in
build_skb.  This allows us to compute the original truesize of the entire
buffer and remove that value leaving us with just what was added as pages.

The second issue was the use of skb->len if there is a mergeable head frag.
We should only need to remove the size of an data aligned sk_buff from our
current skb->truesize to compute the delta for a buffer with a reused head.
By using skb->len the value of truesize was being artificially reduced
which means that head frags could use more memory than buffers using
standard allocations.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet: Add missing linux/prefetch.h include to net/core/sock.c
David S. Miller [Thu, 3 May 2012 06:25:55 +0000 (02:25 -0400)]
net: Add missing linux/prefetch.h include to net/core/sock.c

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet: Stop decapitating clones that have a head_frag
Alexander Duyck [Wed, 2 May 2012 18:18:42 +0000 (18:18 +0000)]
net: Stop decapitating clones that have a head_frag

This change is meant ot prevent stealing the skb->head to use as a page in
the event that the skb->head was cloned.  This allows the other clones to
track each other via shinfo->dataref.

Without this we break down to two methods for tracking the reference count,
one being dataref, the other being the page count.  As a result it becomes
difficult to track how many references there are to skb->head.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet: implement tcp coalescing in tcp_queue_rcv()
Eric Dumazet [Wed, 2 May 2012 09:58:29 +0000 (09:58 +0000)]
net: implement tcp coalescing in tcp_queue_rcv()

Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
receiver function when application is not blocked in recvmsg().

Function tcp_queue_rcv() is moved a bit to allow its call from
tcp_data_queue()

This gives good results especially if GRO could not kick, and if skb
head is a fragment.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet: take care of cloned skbs in tcp_try_coalesce()
Eric Dumazet [Wed, 2 May 2012 07:55:58 +0000 (07:55 +0000)]
net: take care of cloned skbs in tcp_try_coalesce()

Before stealing fragments or skb head, we must make sure skbs are not
cloned.

Alexander was worried about destination skb being cloned : In bridge
setups, a driver could be fooled if skb->data_len would not match skb
nr_frags.

If source skb is cloned, we must take references on pages instead.

Bug happened using tcpdump (if not using mmap())

Introduce kfree_skb_partial() helper to cleanup code.

Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agobe2net: Fix EEH error reset before a flash dump completes
Somnath Kotur [Wed, 2 May 2012 03:41:01 +0000 (03:41 +0000)]
be2net: Fix EEH error reset before a flash dump completes

An EEH error can cause the FW to trigger a flash debug dump.
Resetting the card while flash dump is in progress can cause it not to recover.
Wait for it to finish before letting EEH flow to reset the card.

Signed-off-by: Sathya Perla <Sathya.Perla@emulex.com>
Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agobe2net: Record receive queue index in skb to aid RPS.
Somnath Kotur [Wed, 2 May 2012 03:40:49 +0000 (03:40 +0000)]
be2net: Record receive queue index in skb to aid RPS.

Signed-off-by: Sarveshwar Bandi <Sarveshwar.Bandi@emulex.com>
Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agobe2net: Fix to apply duplex value as unknown when link is down.
Somnath Kotur [Wed, 2 May 2012 03:40:32 +0000 (03:40 +0000)]
be2net: Fix to apply duplex value as unknown when link is down.

Suggested-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Sarveshwar Bandi <sarveshwar.bandi@emulex.com>
Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agobe2net: Fix to not set link speed for disabled functions of a UMC card
Somnath Kotur [Wed, 2 May 2012 03:40:16 +0000 (03:40 +0000)]
be2net: Fix to not set link speed for disabled functions of a UMC card

This renders the interface view somewhat inconsistent from the Host OS POV
considering the rest of the interfaces are showing their respective speeds
based on the bandwidth assigned to them.

Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotcp: early retransmit: delayed fast retransmit
Yuchung Cheng [Wed, 2 May 2012 13:30:04 +0000 (13:30 +0000)]
tcp: early retransmit: delayed fast retransmit

Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
Delays the fast retransmit by an interval of RTT/4. We borrow the
RTO timer to implement the delay. If we receive another ACK or send
a new packet, the timer is cancelled and restored to original RTO
value offset by time elapsed.  When the delayed-ER timer fires,
we enter fast recovery and perform fast retransmit.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotcp: early retransmit
Yuchung Cheng [Wed, 2 May 2012 13:30:03 +0000 (13:30 +0000)]
tcp: early retransmit

This patch implements RFC 5827 early retransmit (ER) for TCP.
It reduces DUPACK threshold (dupthresh) if outstanding packets are
less than 4 to recover losses by fast recovery instead of timeout.

While the algorithm is simple, small but frequent network reordering
makes this feature dangerous: the connection repeatedly enter
false recovery and degrade performance. Therefore we implement
a mitigation suggested in the appendix of the RFC that delays
entering fast recovery by a small interval, i.e., RTT/4. Currently
ER is conservative and is disabled for the rest of the connection
after the first reordering event. A large scale web server
experiment on the performance impact of ER is summarized in
section 6 of the paper "Proportional Rate Reduction for TCP”,
IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf

Note that Linux has a similar feature called THIN_DUPACK. The
differences are THIN_DUPACK do not mitigate reorderings and is only
used after slow start. Currently ER is disabled if THIN_DUPACK is
enabled. I would be happy to merge THIN_DUPACK feature with ER if
people think it's a good idea.

ER is enabled by sysctl_tcp_early_retrans:
  0: Disables ER

  1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.

  2: (Default) reduce dupthresh like mode 1. In addition, delay
     entering fast recovery by RTT/4.

Note: mode 2 is implemented in the third part of this patch series.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotcp: early retransmit: tcp_enter_recovery()
Yuchung Cheng [Wed, 2 May 2012 13:30:02 +0000 (13:30 +0000)]
tcp: early retransmit: tcp_enter_recovery()

This a prepartion patch that refactors the code to enter recovery
into a new function tcp_enter_recovery(). It's needed to implement
the delayed fast retransmit in ER.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet/pasemi: fix compiler warning
Stephen Rothwell [Thu, 3 May 2012 00:51:46 +0000 (10:51 +1000)]
net/pasemi: fix compiler warning

Fix this compiler warning (on PowerPC) by not marking a parameter as
const:

drivers/net/ethernet/pasemi/pasemi_mac.c: In function 'pasemi_mac_replenish_rx_ring':
drivers/net/ethernet/pasemi/pasemi_mac.c:646:3: warning: passing argument 1 of 'netdev_alloc_skb' discards qualifiers from pointer target type
include/linux/skbuff.h:1706:31: note: expected 'struct net_device *' but argument is of type 'const struct net_device *'

Cc: Olof Johansson <olof@lixom.net>
Cc: Pradeep A. Dalvi <netdev@pradeepdalvi.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agobnx2x: fix handling single MSIX mode for 57710/57711
Dmitry Kravkov [Wed, 2 May 2012 01:16:33 +0000 (01:16 +0000)]
bnx2x: fix handling single MSIX mode for 57710/57711

commit 30a5de7723a8a4211be02e94236e9167a424fd07 added
ability to use single MSI-X vector, but lack proper
handling for 57710/57711 HW

Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoixgbe: Reset max_vfs to zero when user request is out of range
Greg Rose [Tue, 17 Apr 2012 04:29:29 +0000 (04:29 +0000)]
ixgbe: Reset max_vfs to zero when user request is out of range

If the user request for the number of VFs in the max_vfs parameter is
out of range then reset the value to the default value of zero.  This
makes the behavior of the ixgbe driver the same as for the igb driver.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Robert Garrett <robertx.e.garrett@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoixgbe: Deny MACVLAN requests from VFs with admin set MAC
Greg Rose [Sat, 24 Mar 2012 00:26:44 +0000 (00:26 +0000)]
ixgbe: Deny MACVLAN requests from VFs with admin set MAC

If the host VMM administrator has set the virtual function device's
MAC address then also deny VF requests for MACVLAN filters.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Garrett, Robert <robertx.e.garrett@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoixgbe: add hwmon interface to export thermal data
Don Skidmore [Thu, 12 Apr 2012 00:33:31 +0000 (00:33 +0000)]
ixgbe: add hwmon interface to export thermal data

Some of our adapters have thermal data available, this patch exports
this data via hwmon sysfs interface.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Stephen Ko <stephen.s.ko@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoixgbe: add support functions to access thermal data
Don Skidmore [Fri, 17 Feb 2012 02:38:58 +0000 (02:38 +0000)]
ixgbe: add support functions to access thermal data

Some 82599 adapters contain thermal data that we can get to via
an i2c interface.  These functions provide support to get at that
data.  A following patch will export this data.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoe1000e: fix .ndo_set_rx_mode for 82579
Bruce Allan [Sat, 14 Apr 2012 03:28:50 +0000 (03:28 +0000)]
e1000e: fix .ndo_set_rx_mode for 82579

Secondary unicast and multicast addresses are added to the Receive
Address registers (RAR) for most parts supported by the driver.  For
82579, there is only one actual RAR and a number of Shared Receive Address
registers (SHRAR) that are shared among the driver and f/w which can be
reserved and write-protected by the f/w.  On this device, use the SHRARs
that are not taken by f/w for the additional addresses.

Add a MAC ops function pointer infrastructure (similar to other MAC
operations in the driver) for setting RARs, introduce a new rar_set
function for 82579 and convert the existing code that sets RARs on other
devices to a generic rar_set function.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoe1000e: PHY initialization flow changes for 82577/8/9
Bruce Allan [Fri, 13 Apr 2012 03:16:22 +0000 (03:16 +0000)]
e1000e: PHY initialization flow changes for 82577/8/9

The PHY initialization flows and assorted workarounds for 82577/8/9 done
during driver load and resume from Sx should be the same yet they are not.
Combine the current flows/workarounds into a common set of functions that
are called during the different code paths.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoe1000e: workaround EEPROM configuration change on 82579
Bruce Allan [Tue, 20 Mar 2012 03:47:57 +0000 (03:47 +0000)]
e1000e: workaround EEPROM configuration change on 82579

An update to the EEPROM on 82579 will extend a delay in hardware to fix an
issue with WoL not working after a G3->S5 transition which is unrelated to
the driver.  However, this extended delay conflicts with nominal operation
of the device when it is initialized by the driver and after every reset
of the hardware (i.e. the driver starts configuring the device before the
hardware is done with it's own configuration work).  The workaround for
when the driver is in control of the device is to tell the hardware after
every reset the configuration delay should be the original shorter one.

Some pre-existing variables are renamed generically to be re-used with
new register accesses.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agonetem: add ECN capability
Eric Dumazet [Mon, 30 Apr 2012 23:11:05 +0000 (23:11 +0000)]
netem: add ECN capability

Add ECN (Explicit Congestion Notification) marking capability to netem

tc qdisc add dev eth0 root netem drop 0.5 ecn

Instead of dropping packets, try to ECN mark them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet: skb_peek()/skb_peek_tail() cleanups
Eric Dumazet [Mon, 30 Apr 2012 16:31:46 +0000 (16:31 +0000)]
net: skb_peek()/skb_peek_tail() cleanups

remove useless casts and rename variables for less confusion.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet: add a prefetch in socket backlog processing
Eric Dumazet [Mon, 30 Apr 2012 16:07:09 +0000 (16:07 +0000)]
net: add a prefetch in socket backlog processing

TCP or UDP stacks have big enough latencies that prefetching next
pointer is worth it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agol2tp: let iproute2 create L2TPv3 IP tunnels using IPv6
James Chapman [Sun, 29 Apr 2012 21:48:55 +0000 (21:48 +0000)]
l2tp: let iproute2 create L2TPv3 IP tunnels using IPv6

The netlink API lets users create unmanaged L2TPv3 tunnels using
iproute2. Until now, a request to create an unmanaged L2TPv3 IP
encapsulation tunnel over IPv6 would be rejected with
EPROTONOSUPPORT. Now that l2tp_ip6 implements sockets for L2TP IP
encapsulation over IPv6, we can add support for that tunnel type.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agol2tp: introduce L2TPv3 IP encapsulation support for IPv6
Chris Elston [Sun, 29 Apr 2012 21:48:54 +0000 (21:48 +0000)]
l2tp: introduce L2TPv3 IP encapsulation support for IPv6

L2TPv3 defines an IP encapsulation packet format where data is carried
directly over IP (no UDP). The kernel already has support for L2TP IP
encapsulation over IPv4 (l2tp_ip). This patch introduces support for
L2TP IP encapsulation over IPv6.

The implementation is derived from ipv6/raw and ipv4/l2tp_ip.

Signed-off-by: Chris Elston <celston@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoipv6: Export ipv6 functions for use by other protocols
Chris Elston [Sun, 29 Apr 2012 21:48:53 +0000 (21:48 +0000)]
ipv6: Export ipv6 functions for use by other protocols

For implementing other protocols on top of IPv6, such as L2TPv3's IP
encapsulation over ipv6, we'd like to call some IPv6 functions which
are not currently exported. This patch exports them.

Signed-off-by: Chris Elston <celston@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agol2tp: netlink api for l2tpv3 ipv6 unmanaged tunnels
Chris Elston [Sun, 29 Apr 2012 21:48:52 +0000 (21:48 +0000)]
l2tp: netlink api for l2tpv3 ipv6 unmanaged tunnels

This patch adds support for unmanaged L2TPv3 tunnels over IPv6 using
the netlink API. We already support unmanaged L2TPv3 tunnels over
IPv4. A patch to iproute2 to make use of this feature will be
submitted separately.

Signed-off-by: Chris Elston <celston@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agol2tp: show IPv6 addresses in l2tp debugfs file
Chris Elston [Sun, 29 Apr 2012 21:48:51 +0000 (21:48 +0000)]
l2tp: show IPv6 addresses in l2tp debugfs file

If an L2TP tunnel uses IPv6, make sure the l2tp debugfs file shows the
IPv6 address correctly.

Signed-off-by: Chris Elston <celston@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agol2tp: pppol2tp_connect() handles ipv6 sockaddr variants
James Chapman [Sun, 29 Apr 2012 21:48:50 +0000 (21:48 +0000)]
l2tp: pppol2tp_connect() handles ipv6 sockaddr variants

Userspace uses connect() to associate a pppol2tp socket with a tunnel
socket. This needs to allow the caller to supply the new IPv6
sockaddr_pppol2tp structures if IPv6 is used.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agopppox: Replace __attribute__((packed)) in if_pppox.h
James Chapman [Sun, 29 Apr 2012 21:48:49 +0000 (21:48 +0000)]
pppox: Replace __attribute__((packed)) in if_pppox.h

Checkpatch warns about the use of __attribute__((packed)). So use the
recommended __packed syntax instead.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agol2tp: remove unused stats from l2tp_ip socket
James Chapman [Sun, 29 Apr 2012 21:48:48 +0000 (21:48 +0000)]
l2tp: remove unused stats from l2tp_ip socket

The l2tp_ip socket currently maintains packet/byte stats in its
private socket structure. But these counters aren't exposed to
userspace and so serve no purpose. The counters were also
smp-unsafe. So this patch just gets rid of the stats.

While here, change a couple of internal __u32 variables to u32.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agol2tp: Use ip4_datagram_connect() in l2tp_ip_connect()
James Chapman [Sun, 29 Apr 2012 21:48:47 +0000 (21:48 +0000)]
l2tp: Use ip4_datagram_connect() in l2tp_ip_connect()

Cleanup the l2tp_ip code to make use of an existing ipv4 support function.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agol2tp: fix locking of 64-bit counters for smp
James Chapman [Sun, 29 Apr 2012 21:48:46 +0000 (21:48 +0000)]
l2tp: fix locking of 64-bit counters for smp

L2TP uses 64-bit counters but since these are not updated atomically,
we need to make them safe for smp. This patch addresses that.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoatl1c: remove PHY polling from atl1c_change_mtu
Huang, Xiong [Mon, 30 Apr 2012 15:38:58 +0000 (15:38 +0000)]
atl1c: remove PHY polling from atl1c_change_mtu

PHY polling code for FPGA is considered in every MDIO R/W API.
no need to add additional code to atl1c_change_mtu.

Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: David Liu <dwliu@qca.qaulcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoatl1c: Disable L0S when no cable link
Huang, Xiong [Mon, 30 Apr 2012 15:38:57 +0000 (15:38 +0000)]
atl1c: Disable L0S when no cable link

L0S might be unstable if no cable link, only enable it when link up.

Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoatl1c: do MAC-reset when PHY link down
Huang, Xiong [Mon, 30 Apr 2012 15:38:56 +0000 (15:38 +0000)]
atl1c: do MAC-reset when PHY link down

There may be tx-skbs still pending in HW when PHY link down.
Reset MAC will make the DMA engine go to the start point.
and release all pending skbs.
Note: Reset MAC will clear any interrupt status and mask.

Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoatl1c: cancel task when interface closed
Huang, Xiong [Mon, 30 Apr 2012 15:38:55 +0000 (15:38 +0000)]
atl1c: cancel task when interface closed

common_task might be running while close routine is called,
wait/cancel it.

Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoatl1c: enlarge L1 response waiting timer
Huang, Xiong [Mon, 30 Apr 2012 15:38:54 +0000 (15:38 +0000)]
atl1c: enlarge L1 response waiting timer

The hardware incorrectly process L0S/L1 entrance if the chipset/root
response after specific/shorter timer and cause system hang.
Enlarge the timeout value to avoid this issue.

Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoatl1c: refine mac address related code
Huang, Xiong [Mon, 30 Apr 2012 15:38:53 +0000 (15:38 +0000)]
atl1c: refine mac address related code

On some platform with EEPROM/OTP existing, the BIOS could overwrite
a new MAC address for the NIC. so, the permanent mac address should
be from BIOS. the address is restored when driver removing.
Voltage raising isn't applicable for l1d.
Replace swab32 with htonl for big/little endian platform.
related Registers are refined as well.

Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoatl1c: remove code of closing register writable attribution
Huang, Xiong [Mon, 30 Apr 2012 15:38:52 +0000 (15:38 +0000)]
atl1c: remove code of closing register writable attribution

The Close-action is done by atl1c_reset_pcie, remove it from
atl1c_get_permanent_address.

Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoatl1c: clear WoL status when reset pcie
Huang, Xiong [Mon, 30 Apr 2012 15:38:51 +0000 (15:38 +0000)]
atl1c: clear WoL status when reset pcie

WoL status is read-clear and should be cleared when in S0
status.
putting it in atl1c_reset_pcie is more suitable than
in atl1c_get_permanent_address.

Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoatl1c: add PHY link event(up/down) patch
Huang, Xiong [Mon, 30 Apr 2012 15:38:50 +0000 (15:38 +0000)]
atl1c: add PHY link event(up/down) patch

On some platforms the PHY settings need to change depending on the
cable link status to get better stability.

Signed-off-by: xiong <xiong@qca.qualcomm.com>
Tested-by: Liu David <dwliu@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoatl1c: add workaround for issue of bit INTX-disable for MSI interrupt
Huang, Xiong [Mon, 30 Apr 2012 15:38:49 +0000 (15:38 +0000)]
atl1c: add workaround for issue of bit INTX-disable for MSI interrupt

All supported devices have one issue that msi interrupt doesn't assert
if pci command register bit (PCI_COMMAND_INTX_DISABLE) is set.
Add workaround in drivers/pci/quirks.c

Signed-off-by: xiong <xiong@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoMerge branch 'tipc_net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg...
David S. Miller [Tue, 1 May 2012 01:42:30 +0000 (21:42 -0400)]
Merge branch 'tipc_net-next' of git://git./linux/kernel/git/paulg/linux

12 years agobnx2x: remove some bloat
Eric Dumazet [Fri, 27 Apr 2012 21:39:21 +0000 (21:39 +0000)]
bnx2x: remove some bloat

Before doing skb->head_frag work on bnx2x driver, I found too much stuff
was inlined in bnx2x/bnx2x_cmn.h for no good reason and made my work not
very easy.

Move some big functions out of this include file to the respective .c
file.

A lot of inline keywords are not needed at all in this huge driver.

   text    data     bss     dec     hex filename
 490083    1270      56  491409   77f91 bnx2x/bnx2x.ko.before
 484206    1270      56  485532   7689c bnx2x/bnx2x.ko

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eilon Greenstein <eilong@broadcom.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agopch_gbe: reprogram multicast address register on reset
RongQing.Li [Fri, 27 Apr 2012 19:53:41 +0000 (19:53 +0000)]
pch_gbe: reprogram multicast address register on reset

The reset logic after a Rx FIFO overrun will clear the programmed
multicast addresses. This patch fixes the issue by reprogramming the
registers after the reset.

The commit eefc48b ("pch_gbe: reprogram multicast address register on
reset") tried to fix this problem, but it introduces unnecessary
codes. In fact, all multicast addresses have been saved in netdev->mc,
So we can call pch_gbe_set_multi() directly after reset_hw and
reset_rx.

This commit kills 50+ line codes

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Takahiro Shimizu <tshimizu818@gmail.com>
Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet: makes skb_splice_bits() aware of skb->head_frag
Eric Dumazet [Fri, 27 Apr 2012 02:10:03 +0000 (02:10 +0000)]
net: makes skb_splice_bits() aware of skb->head_frag

__skb_splice_bits() can check if skb to be spliced has its skb->head
mapped to a page fragment, instead of a kmalloc() area.

If so we can avoid a copy of the skb head and get a reference on
underlying page.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotcp: makes tcp_try_coalesce aware of skb->head_frag
Eric Dumazet [Fri, 27 Apr 2012 00:38:33 +0000 (00:38 +0000)]
tcp: makes tcp_try_coalesce aware of skb->head_frag

TCP coalesce can check if skb to be merged has its skb->head mapped to a
page fragment, instead of a kmalloc() area.

We had to disable coalescing in this case, for performance reasons.

We 'upgrade' skb->head as a fragment in itself.

This reduces number of cache misses when user makes its copies, since a
less sk_buff are fetched.

This makes receive and ofo queues shorter and thus reduce cache line
misses in TCP stack.

This is a followup of patch "net: allow skb->head to be a page fragment"

Tested with tg3 nic, with GRO on or off. We can see "TCPRcvCoalesce"
counter being incremented.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet: make GRO aware of skb->head_frag
Eric Dumazet [Mon, 30 Apr 2012 08:10:34 +0000 (08:10 +0000)]
net: make GRO aware of skb->head_frag

GRO can check if skb to be merged has its skb->head mapped to a page
fragment, instead of a kmalloc() area.

We 'upgrade' skb->head as a fragment in itself

This avoids the frag_list fallback, and permits to build true GRO skb
(one sk_buff and up to 16 fragments), using less memory.

This reduces number of cache misses when user makes its copy, since a
single sk_buff is fetched.

This is a followup of patch "net: allow skb->head to be a page fragment"

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotg3: provide frags as skb head
Eric Dumazet [Fri, 27 Apr 2012 00:34:49 +0000 (00:34 +0000)]
tg3: provide frags as skb head

This patch converts tg3 driver, one of our reference drivers, to use new
build_skb() api in frag mode.

Instead of using kmalloc() to allocate the memory block that will be
used by build_skb() as skb->head, we use a page fragment.

This is a followup of patch "net: allow skb->head to be a page fragment"

This allows GRO, TCP coalescing, and splice() to be more efficient.

Incidentally, this also removes SLUB slow path contention in kfree()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet: allow skb->head to be a page fragment
Eric Dumazet [Fri, 27 Apr 2012 00:33:38 +0000 (00:33 +0000)]
net: allow skb->head to be a page fragment

skb->head is currently allocated from kmalloc(). This is convenient but
has the drawback the data cannot be converted to a page fragment if
needed.

We have three spots were it hurts :

1) GRO aggregation

 When a linear skb must be appended to another skb, GRO uses the
frag_list fallback, very inefficient since we keep all struct sk_buff
around. So drivers enabling GRO but delivering linear skbs to network
stack aren't enabling full GRO power.

2) splice(socket -> pipe).

 We must copy the linear part to a page fragment.
 This kind of defeats splice() purpose (zero copy claim)

3) TCP coalescing.

 Recently introduced, this permits to group several contiguous segments
into a single skb. This shortens queue lengths and save kernel memory,
and greatly reduce probabilities of TCP collapses. This coalescing
doesnt work on linear skbs (or we would need to copy data, this would be
too slow)

Given all these issues, the following patch introduces the possibility
of having skb->head be a fragment in itself. We use a new skb flag,
skb->head_frag to carry this information.

build_skb() is changed to accept a frag_size argument. Drivers willing
to provide a page fragment instead of kmalloc() data will set a non zero
value, set to the fragment size.

Then, on situations we need to convert the skb head to a frag in itself,
we can check if skb->head_frag is set and avoid the copies or various
fallbacks we have.

This means drivers currently using frags could be updated to avoid the
current skb->head allocation and reduce their memory footprint (aka skb
truesize). (thats 512 or 1024 bytes saved per skb). This also makes
bpf/netfilter faster since the 'first frag' will be part of skb linear
part, no need to copy data.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoforcedeth: add transmit timestamping support
Willem de Bruijn [Fri, 27 Apr 2012 09:04:07 +0000 (09:04 +0000)]
forcedeth: add transmit timestamping support

Insert an skb_tx_timestamp call in both ndo_start_xmit routines
Tested to work for the nv_start_xmit_optimized case

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agobnx2x: add transmit timestamping support
Willem de Bruijn [Fri, 27 Apr 2012 09:04:06 +0000 (09:04 +0000)]
bnx2x: add transmit timestamping support

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eilon Greenstein <eilong@broadcom.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoe1000e: add transmit timestamping support
Willem de Bruijn [Fri, 27 Apr 2012 09:04:05 +0000 (09:04 +0000)]
e1000e: add transmit timestamping support

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoe1000: add transmit timestamping support
Willem de Bruijn [Fri, 27 Apr 2012 09:04:04 +0000 (09:04 +0000)]
e1000: add transmit timestamping support

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotipc: compress out gratuitous extra carriage returns
Paul Gortmaker [Mon, 30 Apr 2012 19:29:02 +0000 (15:29 -0400)]
tipc: compress out gratuitous extra carriage returns

Some of the comment blocks are floating in limbo between two
functions, or between blocks of code.  Delete the extra line
feeds between any comment and its associated following block
of code, to be consistent with the majority of the rest of
the kernel.  Also delete trailing newlines at EOF and fix
a couple trivial typos in existing comments.

This is a 100% cosmetic change with no runtime impact.  We get
rid of over 500 lines of non-code, and being blank line deletes,
they won't even show up as noise in git blame.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
12 years agobridge: Fix fatal typo in setup of multicast_querier_expired
Herbert Xu [Mon, 30 Apr 2012 00:22:56 +0000 (00:22 +0000)]
bridge: Fix fatal typo in setup of multicast_querier_expired

Unfortunately it seems that I didn't properly test the case of
an expired external querier in the recent multicast bridge series.

The setup of the timer in that case is completely broken and leads
to a NULL-pointer dereference.  This patch fixes it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agol2tp: Add missing net/net/ip6_checksum.h include.
David S. Miller [Mon, 30 Apr 2012 17:21:28 +0000 (13:21 -0400)]
l2tp: Add missing net/net/ip6_checksum.h include.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet/l2tp: add support for L2TP over IPv6 UDP
Benjamin LaHaise [Fri, 27 Apr 2012 08:24:18 +0000 (08:24 +0000)]
net/l2tp: add support for L2TP over IPv6 UDP

Now that encap_rcv() works on IPv6 UDP sockets, wire L2TP up to IPv6.
Support has been tested with and without hardware offloading.  This
version fixes the L2TP over localhost issue with incorrect checksums
being reported.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet/ipv6/udp: UDP encapsulation: introduce encap_rcv hook into IPv6
Benjamin LaHaise [Fri, 27 Apr 2012 08:24:08 +0000 (08:24 +0000)]
net/ipv6/udp: UDP encapsulation: introduce encap_rcv hook into IPv6

Now that the sematics of udpv6_queue_rcv_skb() match IPv4's
udp_queue_rcv_skb(), introduce the UDP encap_rcv() hook for IPv6.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet/ipv6/udp: UDP encapsulation: move socket locking into udpv6_queue_rcv_skb()
Benjamin LaHaise [Fri, 27 Apr 2012 08:23:59 +0000 (08:23 +0000)]
net/ipv6/udp: UDP encapsulation: move socket locking into udpv6_queue_rcv_skb()

In order to make sure that when the encap_rcv() hook is introduced it is
not called with the socket lock held, move socket locking from callers into
udpv6_queue_rcv_skb(), matching what happens in IPv4.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb
Benjamin LaHaise [Fri, 27 Apr 2012 08:23:21 +0000 (08:23 +0000)]
net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb

This is the first step in reworking the IPv6 UDP code to be structured more
like the IPv4 UDP code.  This patch creates __udpv6_queue_rcv_skb() with
the equivalent sematics to __udp_queue_rcv_skb(), and wires it up to the
backlog_rcv method.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net...
David S. Miller [Sun, 29 Apr 2012 02:06:17 +0000 (22:06 -0400)]
Merge branch 'master' of git://git./linux/kernel/git/jkirsher/net-next

12 years agodrivers/net/oki-semi: Donot recompute IP header checksum
RongQing.Li [Thu, 26 Apr 2012 21:01:13 +0000 (21:01 +0000)]
drivers/net/oki-semi: Donot recompute IP header checksum

If I understand correct, NETIF_F_IP_CSUM only means the hardware
will compute the TCP/UDP checksum, IP checksum is always computed
in software

So as a workround of hardware unable to compute small packages
checksum, do not need to compute IP header checksum.

Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agodrivers/net/oki-semi: Remove the definition of PCH_GBE_ETH_ALEN
RongQing.Li [Thu, 26 Apr 2012 21:01:12 +0000 (21:01 +0000)]
drivers/net/oki-semi: Remove the definition of PCH_GBE_ETH_ALEN

PCH_GBE_ETH_ALEN is equal to ETH_ALEN, so we can replace it with
ETH_ALEN.

If they are not equal, it must be a bug, since this is ethernet,
and the address has been already stored to mc_addr_list as ETH_ALEN
bytes when call pch_gbe_mac_mc_addr_list_update.

Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet/at91_ether: use gpio_to_irq for phy IRQ line
Nicolas Ferre [Thu, 26 Apr 2012 00:30:43 +0000 (00:30 +0000)]
net/at91_ether: use gpio_to_irq for phy IRQ line

Use the gpio_to_irq() function to retrieve the phy IRQ line
from the GPIO pin specification.
This fix is needed now that we have moved to irqdomains on AT91.

Reported-by: Jamie Iles <jamie@jamieiles.com>
Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Cc: Andrew Victor <avictor.za@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoAT91: Remove fixed mapping for AT91RM9200 ethernet
Andrew Victor [Thu, 26 Apr 2012 00:30:42 +0000 (00:30 +0000)]
AT91: Remove fixed mapping for AT91RM9200 ethernet

The AT91RM9200 Ethernet controller still has a fixed IO mapping.
So:
* Remove the fixed IO mapping and AT91_VA_BASE_EMAC definition.
* Pass the physical base-address via platform-resources to the driver.
* Convert at91_ether.c driver to perform an ioremap().
* Ethernet PHY detection needs to be performed during the driver
initialization process, it can no longer be done first.

Signed-off-by: Andrew Victor <linux@maxim.org.za>
Signed-off-by: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet: Fixed a coding style issue related to spaces.
Jeffrin Jose [Wed, 25 Apr 2012 13:47:29 +0000 (19:17 +0530)]
net: Fixed a coding style issue related to spaces.

Fixed a coding style issue relating to spaces
in net/core/sock.c

Signed-off-by: Jeffrin Jose <ahiliation@yahoo.co.in>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotipc: Reject payload messages with invalid message type
Allan Stephens [Thu, 26 Apr 2012 22:13:08 +0000 (18:13 -0400)]
tipc: Reject payload messages with invalid message type

Adds check to ensure TIPC sockets reject incoming payload messages
that have an unrecognized message type.

Remove the old open question about whether TIPC_ERR_NO_PORT is
the proper return value.  It is appropriate here since there are
valid instances where another node can make use of the reply,
and at this point in time the host is already broadcasting TIPC
data, so there are no real security concerns.

Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
12 years agoixgbe: check for WoL support in single function
Jacob Keller [Sat, 21 Apr 2012 06:05:40 +0000 (06:05 +0000)]
ixgbe: check for WoL support in single function

This patch consolidates the case logic for checking whether a device supports
WoL into a single place. Previously ethtool and probe used similar logic that
was copied and maintained separately. This patch encapsulates the core logic
into a function so that a user only has to update one place.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Stephen Ko <stephen.s.ko@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoigb: Force flow control off during reset when forcing speed.
Matthew Vick [Wed, 18 Apr 2012 02:57:44 +0000 (02:57 +0000)]
igb: Force flow control off during reset when forcing speed.

During igb_reset(), we initiate a hardware reset which will clear our
flow control settings. For auto-negotiation, we re-negotiate them when
linking up again, but we need to force them off properly for the forced
speed case.

Signed-off-by: Matthew Vick <matthew.vick@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoe1000e: 82579 potential system hang on stress when ME enabled
Bruce Allan [Tue, 20 Mar 2012 03:47:52 +0000 (03:47 +0000)]
e1000e: 82579 potential system hang on stress when ME enabled

Previously, a workaround was added to address a hardware bug in the
PCIm2PCI arbiter where a write by the driver of the Transmit/Receive
Descriptor Tail register could happen concurrently with a write of any
MAC CSR register by the Manageability Engine (ME) which could cause the
Tail register to have an incorrect value.  The arbiter is supposed to
prevent the concurrent writes but there is a bug that can cause the Host
(driver) access to be acknowledged later than it should.
After further investigation, it was discovered that a driver write access
of any MAC CSR register after being idle for some time can be lost when
ME is accessing a MAC CSR register.  When this happens, no further target
access is claimed by the MAC which could hang the system.
The workaround to check bit 24 in the FWSM register (set only when ME is
accessing a MAC CSR register) and delay for a limited amount of time until
it is cleared is now done for all driver writes of MAC CSR registers on
82579 with ME enabled.  In the rare case when the driver is writing the
Tail register and ME is accessing any MAC CSR register for a duration
longer than the maximum delay, write the register and verify it has the
correct value before continuing, otherwise reset the device.

This patch also moves some pre-existing macros from the hardware-specific
header file to the more appropriate generic driver header file.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoe1000e: 82579 packet drop workaround
Bruce Allan [Tue, 20 Mar 2012 03:47:47 +0000 (03:47 +0000)]
e1000e: 82579 packet drop workaround

In K1 mode (a MAC/PHY interconnect power mode), the 82579 device shuts down
the Phase Lock Loop (PLL) of the interconnect to save power.  When the PLL
starts working, the 82579 device may start to transfer the packet through
the interconnect before it is fully functional causing packet drops.  This
workaround disables shutting down the PLL in K1 mode for 1G link speed.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoe1000e: Enable DMA Burst Mode on 82574 by default.
Matthew Vick [Fri, 16 Mar 2012 09:02:59 +0000 (09:02 +0000)]
e1000e: Enable DMA Burst Mode on 82574 by default.

Performance testing has shown that enabling DMA burst on 82574
improves performance on small packets, so enable it by default.

Signed-off-by: Matthew Vick <matthew.vick@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
12 years agoe1000e: Disable Far-End LoopBack following reset on 80003ES2LAN.
Matthew Vick [Fri, 16 Mar 2012 09:02:58 +0000 (09:02 +0000)]
e1000e: Disable Far-End LoopBack following reset on 80003ES2LAN.

80003ES2LAN has an errata such that far-end loopback may be activated by
bit errors producing a reserved symbol. In order to disable far-end
loopback quickly enough, disable it immediately following a reset.

Signed-off-by: Matthew Vick <matthew.vick@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>