git.stricted.de - GitHub/moto-9609/android_kernel_motorola

netfilter: nf_conntrack: make nf_ct_zone_dflt built-in

Fengguang reported, that some randconfig generated the following linker
issue with nf_ct_zone_dflt object involved:

  [...]
  CC      init/version.o
  LD      init/built-in.o
  net/built-in.o: In function `ipv4_conntrack_defrag':
  nf_defrag_ipv4.c:(.text+0x93e95): undefined reference to `nf_ct_zone_dflt'
  net/built-in.o: In function `ipv6_defrag':
  nf_defrag_ipv6_hooks.c:(.text+0xe3ffe): undefined reference to `nf_ct_zone_dflt'
  make: *** [vmlinux] Error 1

Given that configurations exist where we have a built-in part, which is
accessing nf_ct_zone_dflt such as the two handlers nf_ct_defrag_user()
and nf_ct6_defrag_user(), and a part that configures nf_conntrack as a
module, we must move nf_ct_zone_dflt into a fixed, guaranteed built-in
area when netfilter is configured in general.

Therefore, split the more generic parts into a common header under
include/linux/netfilter/ and move nf_ct_zone_dflt into the built-in
section that already holds parts related to CONFIG_NF_CONNTRACK in the
netfilter core. This fixes the issue on my side.

Fixes: 308ac9143ee2 ("netfilter: nf_conntrack: push zone object into functions")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

netfilter: nf_dup{4, 6}: fix build error when nf_conntrack disabled

While testing various Kconfig options on another issue, I found that
the following one triggers as well on allmodconfig and nf_conntrack
disabled:

  net/ipv4/netfilter/nf_dup_ipv4.c: In function ‘nf_dup_ipv4’:
  net/ipv4/netfilter/nf_dup_ipv4.c:72:20: error: ‘nf_skb_duplicated’ undeclared (first use in this function)
    if (this_cpu_read(nf_skb_duplicated))
  [...]
  net/ipv6/netfilter/nf_dup_ipv6.c: In function ‘nf_dup_ipv6’:
  net/ipv6/netfilter/nf_dup_ipv6.c:66:20: error: ‘nf_skb_duplicated’ undeclared (first use in this function)
    if (this_cpu_read(nf_skb_duplicated))

Fix it by including directly the header where it is defined.

Fixes: bbde9fc1824a ("netfilter: factor out packet duplication for IPv4/IPv6")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: fec: clear receive interrupts before processing a packet

The patch just to re-submit the patch "db3421c114cfa6326" because the
patch "4d494cdc92b3b9a0" remove the change.

Clear any pending receive interrupt before we process a pending packet.
This helps to avoid any spurious interrupts being raised after we have
fully cleaned the receive ring, while still allowing an interrupt to be
raised if we receive another packet.

The position of this is critical: we must do this prior to reading the
next packet status to avoid potentially dropping an interrupt when a
packet is still pending.

Acked-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: fix exthdrs offload registration in out_rt path

We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
but in error path, we try to unregister it with inet_del_offload(). This
doesn't seem correct, it should actually be inet6_del_offload(), also
ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
also uses rthdr_offload twice), but it got removed entirely later on.

Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

xen-netback: add support for multicast control

Xen's PV network protocol includes messages to add/remove ethernet
multicast addresses to/from a filter list in the backend. This allows
the frontend to request the backend only forward multicast packets
which are of interest thus preventing unnecessary noise on the shared
ring.

The canonical netif header in git://xenbits.xen.org/xen.git specifies
the message format (two more XEN_NETIF_EXTRA_TYPEs) so the minimal
necessary changes have been pulled into include/xen/interface/io/netif.h.

To prevent the frontend from extending the multicast filter list
arbitrarily a limit (XEN_NETBK_MCAST_MAX) has been set to 64 entries.
This limit is not specified by the protocol and so may change in future.
If the limit is reached then the next XEN_NETIF_EXTRA_TYPE_MCAST_ADD
sent by the frontend will be failed with NETIF_RSP_ERROR.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bgmac: Update fixed_phy_register()

Commit a5597008dbc2 ("phy: fixed_phy: Add gpio to determine link up/down.")
added a new argument to fixed_phy_register(), but missed to update bgmac
driver, causing the following build failure:

drivers/net/ethernet/broadcom/bgmac.c:1450:2: error: too few arguments to function 'fixed_phy_register'

Add the missing argument.

Reported-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sock, diag: fix panic in sock_diag_put_filterinfo

diag socket's sock_diag_put_filterinfo() dumps classic BPF programs
upon request to user space (ss -0 -b). However, native eBPF programs
attached to sockets (SO_ATTACH_BPF) cannot be dumped with this method:

Their orig_prog is always NULL. However, sock_diag_put_filterinfo()
unconditionally tries to access its filter length resp. wants to copy
the filter insns from there. Internal cBPF to eBPF transformations
attached to sockets don't have this issue, as orig_prog state is kept.

It's currently only used by packet sockets. If we would want to add
native eBPF support in the future, this needs to be done through
a different attribute than PACKET_DIAG_FILTER to not confuse possible
user space disassemblers that work on diag data.

Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

flow_dissector: Use 'const' where possible.

Signed-off-by: David S. Miller <davem@davemloft.net>

flow_dissector: Fix function argument ordering dependency

Commit c6cc1ca7f4d70c ("flowi: Abstract out functions to get flow hash
based on flowi") introduced a bug in __skb_set_sw_hash where we
require a dependency on evaluating arguments in a function in order.
There is no such ordering enforced in C, so this incorrect. This
patch fixes that by splitting out the arguments. This bug was
found via a compiler warning that keys may be uninitialized.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'master' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2015-09-01

This series contains updates to i40e, ixgbe and ixgbevf.

Anjali fixes a bug in i40e where the port is not receiving multicast or VLAN
tagged packets in promiscuous mode.  Which can occur when a software bridge
is created on top of the device.

Don adds support in ixgbe that indicates the presence of management firmware.
Added support for entering low power link up state on devices that support
it when the device is closing or suspending.  Updated the driver to report
unknown bus speed and width since IOSF does not report a PCIe bus speed or
width for X550 devices.  Also added the new bus type for integrated I/O
interface (IOSF).  Cleaned up of redundant code in ixgbe.

Mark adds support for UDP-encapsulation transmit checksum and for VXLAN
receive offloads.  Introduces a helper function to do the register access
and processing to avoid needless PHY access on copper PHYs.  Added support
for reporting 2.5G link speed.  Fixed warnings resulting from redundant
initializations of the get_bus_info field.

Maninder Singh updates the ixgbe driver to use kzalloc instead of kcalloc
for allocation of one thing.

Tom Barbette adds support for ethtool to change the rxfh indirection table
and/or key using ethtool interface.

Emil resolves an issue where users were not able to dynamically set number
of queues for 82598 via ethtool -L.

Alex Williamson removes bimodal SR-IOV disabling behavior since it is
confusing to users and results in a state where the PF is broken for other
uses unless the user sets sriov_numvfs to zero prior to unbinding the
device.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: Resolve "initialized field overwritten" warnings

Resolve warnings resulting from redundant initialization of the
get_bus_info field in the mac_ops_X550* structures.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Remove bimodal SR-IOV disabling

When unbinding an SR-IOV device with VFs configured from ixgbe, the
driver behaves in one of two ways.  If max_vfs was specified, the
SR-IOV state is disabled, removing the VFs.  The occurs regardless of
whether the VF count was later modified through sysfs.  If however
max_vfs is zero, such as by not specifying the module parameter, the
VFs persist after the PF is unbound from ixgbe.  If the PF is then
bound to vfio-pci to be assigned to a VM, the PF is non-functional.

>From the comment, commit da36b64736cf ("ixgbe: Implement PCI SR-IOV
sysfs callback operation") clearly intended this alternate behavior,
but probably didn't realize the PF doesn't work in this mode.

This bimodal behavior is confusing to users and results in a state
where the PF is broken for other uses unless the user sets
sriov_numvfs to zero prior to unbinding the device.  Remove this
behavior so that VFs are removed and the PF is functional for other
uses after unbind, regardless of the way VFs are enabled.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Add support for reporting 2.5G link speed

Now that we can do 2.5G link speed, we need to be able to report it.
Also change the nested triadic involved in creating the log message
to instead use a simpler switch statement to set a string pointer.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: fix bounds checking in ixgbe_setup_tc for 82598

This patch resolves an issue where users were not able to dynamically
set number of queues for 82598 via ethtool -L

Reported-by: Tal Abudi <talabudi@gmail.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: support for ethtool set_rxfh

Allows to change the rxfh indirection table and/or key using
ethtool interface.

Signed-off-by: Tom Barbette <tom.barbette@ulg.ac.be>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Avoid needless PHY access on copper phys

Avoid a needless PHY access on copper phys to save the 10ms wait
time for each PHY access. A helper function is introduced to
actually do the register access and process the contents.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: cleanup to use cached mask value

We already cache this FW/SW semaphore mask so might as well use it
for consistency.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Remove second instance of lan_id variable

This patch removes the redundant lan_id in the phy struct and uses
the bus version. Both variables exist and intend to represent the
STATUS register LAN_ID field. However, phy.lan_id is not bit shifted
so the phy.lan_id = 0x0 for LAN Id 0 and phy.lan_id = 0x4 for LAN Id 1.
Where bus.lan_id is bit shifted so bus.lan_id = 0x0 for LAN Id 0 and
bus.lan_id = 0x1 for LAN Id 1. There seems no need for the additional
lan_id variable and this should make the code less confusing.

Signed-off-by: Donald C Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: use kzalloc for allocating one thing

Use kzalloc rather than kcalloc(1..

The semantic patch that makes this change is as follows:

// <smpl>
@@
@@

- kcalloc(1,
+ kzalloc(
...)
// </smpl>

and removing checkpatch below CHECK:
CHECK: Prefer kzalloc(sizeof(*fwd_adapter)...) over
kzalloc(sizeof(struct ixgbe_fwd_adapter)...)

Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Reviewed-by: Vaneet Narang <v.narang@samsung.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

flow: Move __get_hash_from_flowi{4,6} into flow_dissector.c

These cannot live in net/core/flow.c which only builds when XFRM is
enabled.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: Remove unused PCI bus types

The ixgbe never has as very doubtfully ever will support either
PCI or PCI-X devices. So remove the unused types from the
ixgbe_bus_type. Thanks to Alex Duyck for suggesting this.

Signed-off-by: Donald C Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: add new bus type for intergrated I/O interface (IOSF)

With this patch we add support for a new bus type ixgbe_bus_type_internal.
X550em devices use IOSF and not PCIe bus so this new type is to accommodate
them.

Signed-off-by: Donald C Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: add get_bus_info method for X550

Added ixgbe_get_bus_info_X550em to X550 code. ixgbe_get_bus_info_X550em
sets bus.width to ixgbe_bus_width_unknown and bus.speed to
ixgbe_bus_speed_unknown, because IOSF does not report a PCIe bus
width or speed.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Add support for entering low power link up state

When the device is closing or suspending, call ixgbe_enter_lplu to
enter low power link up state on devices that support it. When this
is done, prevent the phy from being reset in the ixgbe_down path
so that link is present when calling ixgbe_enter_lplu.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Add support for VXLAN RX offloads

Add support for VXLAN RX offloads for the X55x devices that support
them.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Add support for UDP-encapsulated tx checksum offload

By using GSO for UDP-encapsulated packets, all ixgbe devices can
be directed to generate checksums for the inner headers because
the outer UDP checksum can be zero. So point the machinery at the
inner headers and have the hardware generate the checksum.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

flow_dissector: Don't use bit fields.

Just have a flags member instead.

   In file included from include/linux/linkage.h:4:0,
                    from include/linux/kernel.h:6,
                    from net/core/flow_dissector.c:1:
   In function 'flow_keys_hash_start',
       inlined from 'flow_hash_from_keys' at net/core/flow_dissector.c:553:34:
>> include/linux/compiler.h:447:38: error: call to '__compiletime_assert_459' declared with attribute error: BUILD_BUG_ON failed: FLOW_KEYS_HASH_OFFSET % sizeof(u32)

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: Check whether FDIRCMD writes actually complete

Wait up to about 100 us for FDIRCMD writes to complete and return
failure indications.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Assign set_phy_power dynamically where needed

There are various reasons why this method may or may not need to be
defined and some of these we don't know until runtime. So we will
set the value in get_invariants.

Signed-off-by: Donald C Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: add new function to check for management presence

This patch adds a support function that will indicate for the
existence of management FW.

Signed-off-by: Donald C Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Set defport behavior for the Main VSI when in promiscuous mode

This fixes bugs where the port is not receiving multicast or VLAN tagged
packets when in promiscuous mode. This can occur when a SW bridge is
created on top of the device.

This also fixes issues where the promiscuous behavior setting was not
being preserved across a reset caused by features being enabled or
disabled.

We are using defport instead of doing a true promiscuous mode because we do
not need to receive the SRIOV or VMDq VSI directed traffic which would suck
up bandwidth and is really not intended for the SW bridge.

In addition, with defport we get VLAN promiscuous behavior which is not
possible from the VSI level promiscuous setting.

Change-ID: Ie21985eac32d5af1c02e9d71c6430a90d5bab40f
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Merge branch 'flow-dissector-features'

Tom Herbert says:

====================
flow_dissector: Paramterize dissection and other features

This patch set adds some new capabilities to flow_dissector:

- Add flags to flow dissector functions to control dissection
  - Flag to stop dissection when L3 header is seen (don't
    dissect L4)
  - Flag to stop dissection when encapsulation is detected
  - Flag to parse first fragment of fragmented packet. This
    may provide L4 ports
- Added new reporting in key_control
  - Packet is a fragment
  - Packet is a first fragment
  - Packet has encapsulation

Also:
  - Make __skb_set_sw_hash a general function
  - Create functions to get a flow hash based on flowi4 or flowi6
    structures without an reference to an skbuff
  - Ignore flow dissector return value from ___skb_get_hash. Just
    use whatever key fields are found to make a hash

Tested:

Ran 200 netperf TCP_RR instances for IPv6 and IPv4. Did not see any
regression. Ran UDP_RR with 10000 byte request and response size
for IPv4 and IPv6, no regression observed however I did see better
performance with IPv6 flow labels due to use of flow labels for L4
hash.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

flow_dissector: Ignore flow dissector return value from ___skb_get_hash

In ___skb_get_hash ignore return value from skb_flow_dissect_flow_keys.
A failure in that function likely means that there was a parse error,
so we may as well use whatever fields were found before the error was
hit. This is also good because it means we won't keep trying to derive
the hash on subsequent calls to skb_get_hash for the same packet.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

flow_dissector: Add control/reporting of encapsulation

Add an input flag to flow dissector on rather dissection should stop
when encapsulation is detected (IP/IP or GRE). Also, add a key_control
flag that indicates encapsulation was encountered during the
dissection.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

flow_dissector: Add flag to stop parsing when an IPv6 flow label is seen

Add an input flag to flow dissector on rather dissection should be
stopped when a flow label is encountered. Presumably, the flow label
is derived from a sufficient hash of an inner transport packet so
further dissection is not needed (that is ports are not included in
the flow hash). Using the flow label instead of ports has the additional
benefit that packet fragments should hash to same value as non-fragments
for a flow (assuming that the same flow label is used).

We set this flag by default in for skb_get_hash.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

flow_dissector: Add flag to stop parsing at L3

Add an input flag to flow dissector on rather dissection should be
stopped when an L3 packet is encountered. This would be useful if a
caller just wanted to get IP addresses of the outermost header (e.g.
to do an L3 hash).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

flow_dissector: Support IPv6 fragment header

Parse NEXTHDR_FRAGMENT. When seen account for it in the fragment bits of
key_control. Also, check if first fragment should be parsed.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

flow_dissector: Add control/reporting of fragmentation

Add an input flag to flow dissector on rather dissection should be
attempted on a first fragment. Also add key_control flags to indicate
that a packet is a fragment or first fragment.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

flow_dissector: Add flags argument to skb_flow_dissector functions

The flags argument will allow control of the dissection process (for
instance whether to parse beyond L3).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

flow_dissector: Jump to exit code in __skb_flow_dissect

Instead of returning immediately (on a parsing failure for instance) we
jump to cleanup code. This always sets protocol values in key_control
(even on a failure there is still valid information in the key_tags that
was set before the problem was hit).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

flowi: Abstract out functions to get flow hash based on flowi

Create __get_hash_from_flowi6 and __get_hash_from_flowi4 to get the
flow keys and hash based on flowi structures. These are called by
__skb_get_hash_flowi6 and __skb_get_hash_flowi4. Also, created
get_hash_from_flowi6 and get_hash_from_flowi4 which can be called
when just the hash value for a flowi is needed.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

skbuff: Make __skb_set_sw_hash a general function

Move __skb_set_sw_hash to skbuff.h and add __skb_set_hash which is
a common method (between __skb_set_sw_hash and skb_set_hash) to set
the hash in an skbuff.

Also, move skb_clear_hash to be closer to __skb_set_hash.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

flow_dissector: Move skb related functions to skbuff.h

Move the flow dissector functions that are specific to skbuffs into
skbuff.h out of flow_dissector.h. This makes flow_dissector.h have
no dependencies on skbuff.h.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tg3: Fix temperature reporting

The temperature registers appear to report values in degrees Celsius
while the hwmon API mandates values to be exposed in millidegrees
Celsius. Do the conversion so that the values reported by "sensors"
are correct.

Fixes: aed93e0bf493 ("tg3: Add hwmon support for temperature")
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Prashant Sreedharan <prashant@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Cc: stable@vger.kernel.org [v3.6+]
Signed-off-by: David S. Miller <davem@davemloft.net>

phylib: fix device deletion order in mdiobus_unregister()

commit 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not
the bus' parent.") uncovered a problem in mdiobus_unregister() which
leads to this warning when I reboot an APM Mustang (arm64) platform:

  WARNING: CPU: 7 PID: 4239 at fs/sysfs/group.c:224 sysfs_remove_group+0xa0/0xa4()
  sysfs group fffffe0000e07a10 not found for kobject 'xgene-mii-eth0:03'
  ...
  CPU: 7 PID: 4239 Comm: reboot Tainted: G            E   4.2.0-0.18.el7.test15.aarch64 #1
  Hardware name: AppliedMicro Mustang/Mustang, BIOS 1.1.0 Aug 26 2015
  Call Trace:
  [<fffffe000009739c>] dump_backtrace+0x0/0x170
  [<fffffe000009752c>] show_stack+0x20/0x2c
  [<fffffe00007436f0>] dump_stack+0x78/0x9c
  [<fffffe00000c2cb4>] warn_slowpath_common+0xa0/0xd8
  [<fffffe00000c2d60>] warn_slowpath_fmt+0x74/0x88
  [<fffffe0000293d3c>] sysfs_remove_group+0x9c/0xa4
  [<fffffe00004a8bac>] dpm_sysfs_remove+0x5c/0x70
  [<fffffe000049b388>] device_del+0x44/0x208
  [<fffffe000049b578>] device_unregister+0x2c/0x7c
  [<fffffe000050dc68>] mdiobus_unregister+0x48/0x94
  [<fffffe000052afd0>] xgene_enet_mdio_remove+0x28/0x44
  [<fffffe000052d3f0>] xgene_enet_remove+0xd0/0xd8
  [<fffffe000052d424>] xgene_enet_shutdown+0x2c/0x3c
  [<fffffe00004a204c>] platform_drv_shutdown+0x24/0x40
  [<fffffe000049d4f4>] device_shutdown+0xf0/0x1b4
  [<fffffe00000e31ec>] kernel_restart_prepare+0x40/0x4c
  [<fffffe00000e32f8>] kernel_restart+0x1c/0x80
  [<fffffe00000e3670>] SyS_reboot+0x17c/0x250

The problem is that mdiobus_unregister() deletes the bus device before
unregistering the phy devices on the bus. This wasn't a problem before
because the phys were not children of the bus:

  /sys/devices/platform/APMC0D05:00/net/eth0/xgene-mii-eth0:03
  /sys/devices/platform/APMC0D05:00/net/eth0/xgene-mii-eth0

But now that they are:

  /sys/devices/platform/APMC0D05:00/net/eth0/xgene-mii-eth0/xgene-mii-eth0:03

when mdiobus_unregister deletes the bus device, the phy subdirs are
removed from sysfs also. So when the phys are unregistered afterward,
we get the warning. This patch changes the order so that phys are
unregistered before the bus device is deleted.

Fixes: 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not the bus' parent.")
Signed-off-by: Mark Salter <msalter@redhat.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Mark Langsdorf <mlangsdo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Make table id type u32

A number of VRF patches used 'int' for table id. It should be u32 to be
consistent with the rest of the stack.

Fixes:
4e3c89920cd3a ("net: Introduce VRF related flags and helpers")
15be405eb2ea9 ("net: Add inet_addr lookup by table")
30bbaa1950055 ("net: Fix up inet_addr_type checks")
021dd3b8a142d ("net: Add routes to the table associated with the device")
dc028da54ed35 ("inet: Move VRF table lookup to inlined function")
f6d3c19274c74 ("net: FIB tracepoints")

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tun_dst: Remove opts_size

opts_size is only written and never read. Following patch
removes this unused variable.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: send only one NEWLINK when RA causes changes

Signed-off-by: Marius Tomaschewski <mt@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

gro_cells: remove spinlock protecting receive queues

As David pointed out, spinlock are no longer needed
to protect the per cpu queues used in gro cells infrastructure.

Also use new napi_complete_done() API so that gro_flush_timeout
tweaks have an effect.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qmi_wwan: Sierra Wireless MC73xx -> Sierra Wireless MC7304/MC7354

Other Sierra Wireless MC73xx devices exist, with different USB IDs.

Cc: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: send NEWLINK on RA managed/otherconf changes

The kernel is applying the RA managed/otherconf flags silently and
forgets to send ifinfo notify to inform about their change when the
router provides a zero reachable_time and retrans_timer as dnsmasq
and many routers send it, which just means unspecified by this router
and the host should continue using whatever value it is already using.
Userspace may monitor the ifinfo notifications to activate dhcpv6.

Signed-off-by: Marius Tomaschewski <mt@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa-port-config'

Andrew Lunn says:

====================
DSA port configuration and status

This patchset allows various switch port settings to be configured and
port status to be sampled. Some of these patches have been posted
before.

The first three patches provide infrastructure for configuring a
switch ports link speed and duplex from a fixed_link phy.

Patch four then uses this infrastructure to allow the CPU and DSA
ports of a switch to be configured using a fixed-link property in the
device tree.

Patches five and six allow a phy-mode property to be specified in the
device tree, and allow this to be used for configuring RGMII delays.

Patches seven through nine allow link status, for example that of an
SFP module, to be read from a gpio.

Changes since v1:

Rewrite 9/9 so that it hopefully does not regression on
868a4215be9a6d80 ("net: phy: fixed_phy: handle link-down case")
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: fixed_phy: Set phy capabilities even when link down.

What features a phy supports is masked in genphy_config_init() by
looking at the PHYs BMSR register.

If the link is down, fixed_phy_update_regs() will only set the auto-
negotiation capable bit in BMSR. Thus genphy_config_init() comes to
the conclusion the PHY can only perform 10/Half, and masks out the
higher speed features. If however the link it up, BMSR is set to
indicate the speed the PHY is capable of auto-negotiating, and
genphy_config_init() does not mask out the high speed features.

To fix this, when the link is down, have fixed_phy_update_regs() leave
the link status, auto-negotiation complete, and link partner
capabilities unset, but set all the local capabilities depending on
the fixed phy speed.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

phy: fixed_phy: Add gpio to determine link up/down.

An SFP module may have a link up/down status pin which can be
connection to a GPIO line of the host. Add support for reading such an
GPIO in the fixed_phy driver.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dsa: mv88e6xxx: Don't poll forced interfaces for state changes

When polling for link status, don't consider ports which have a forced
link. Such ports don't monitor their phy or may not even have a phy.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dsa: mv88e6xxx: Set the RGMII delay based on phy interface

Some Marvell switches allow the RGMII Rx and Tx clock to be delayed
when the port is using RGMII. Have the adjust_link function look at
the phy interface type and enable this delay as requested.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Allow DSA and CPU ports to have a phy-mode property

It can be useful for DSA and CPU ports to have a phy-mode property, in
particular to specify RGMII delays. Parse the property and set it in
the fixed-link phydev.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Allow configuration of CPU & DSA port speeds/duplex

By default, DSA and CPU ports are configured to the maximum speed the
switch supports. However there can be use cases where the peer devices
port is slower. Allow a fixed-link property to be used with the DSA
and CPU port in the device tree, and use this information to configure
the port.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

phy: fixed_phy: Set supported speed in phydev

Set the supported field of the phydev to indicate the speed features
of the phy. If the phy is never attached to a netdev, but used in an
adjust_link() function, the speed will be incorrectly evaluated to
10/half rather than the correct speed/duplex.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dsa: mv88e6xxx: Allow speed/duplex of port to be configured

The current code sets user ports to perform auto negotiation using the
phy. CPU and DSA ports are configured to full duplex and maximum speed
the switch supports.

There are however use cases where the CPU has a slower port, and when
user ports have SFP modules with fixed speed. In these cases, port
settings to be read from a fixed_phy devices. The switch driver then
needs to implement the adjust_link op, so the port settings can be
set.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: Allow PHY devices to identify themselves as Ethernet switches, etc.

Some Ethernet MAC drivers using the PHY library require the hardcoding
of link parameters when interfaced to a switch device, SFP module,
switch to switch port, etc. This has typically lead to various ad-hoc
implementations looking like this:

- using a "fixed PHY" emulated device, which will provide link
indication towards the Ethernet MAC driver and hardware

- pretend there is no PHY and hardcode link parameters, ala mv643x_eth

Based on that, it is desireable to have the PHY drivers advertise the
correct link parameters, just like regular Ethernet PHYs towards their
CPU Ethernet MAC drivers, however, Ethernet MAC drivers should be able
to tell whether this link should be monitored or not. In the context
of an Ethernet switch, SFP module, switch to switch link, we do not
need to monitor this link since it should be always up.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

mpls: fix mpls_net_init memory leak

Fix a memory leak in the mpls netns init function in case of failure. If
register_net_sysctl fails then we need to free the ctl_table.

Fixes: 7720c01f3f59 ("mpls: Add a sysctl to control the size of the mpls label table")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Add tos to validate source tracepoint

TOS is another key aspect of the lookup passed to fib_validate_source.
Add it to the tracepoint.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

lib: move strncpy_from_unsafe() into mm/maccess.c

To fix build errors:
kernel/built-in.o: In function `bpf_trace_printk':
bpf_trace.c:(.text+0x11a254): undefined reference to `strncpy_from_unsafe'
kernel/built-in.o: In function `fetch_memory_string':
trace_kprobe.c:(.text+0x11acf8): undefined reference to `strncpy_from_unsafe'

move strncpy_from_unsafe() next to probe_kernel_read/write()
which use the same memory access style.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 1a6877b9c0c2 ("lib: introduce strncpy_from_unsafe()")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'per-route-dctcp-receive-side'

Daniel Borkmann says:

====================
tcp: receive-side per route dctcp handling

Original cover letter:

  Currently, the following case doesn't use DCTCP, even if it should:

    - responder has f.e. cubic as system wide default
    - 'ip route congctl dctcp $src' was set

  Then, DCTCP is NOT used if a DCTCP sender attempts to connect from a
  host in the $src range: ECT(0) is set, but listen_sk is not dctcp, so
  we fail the INET_ECN_is_not_ect sanity check.

  We also have to examine the dst used for the SYN/ACK reply to make
  this case work.

  In order to minimize additional cost, store the 'ecn is must have'
  information is the dst_features field.

  The set targets -next instead of -net since this doesn't seem to be a
  serious bug and to give the change more soak time until it hits linus
  tree.

v1 -> v2:
- Addressed Dave's feedback, not exposing any bits to user space
- Added patch 3 to reject incorrect configurations
- Rest as is, rebased and retested
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: use dctcp if enabled on the route to the initiator

Currently, the following case doesn't use DCTCP, even if it should:
A responder has f.e. Cubic as system wide default, but for a specific
route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The
initiating host then uses DCTCP as congestion control, but since the
initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok,
and we have to fall back to Reno after 3WHS completes.

We were thinking on how to solve this in a minimal, non-intrusive
way without bloating tcp_ecn_create_request() needlessly: lets cache
the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0)
is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES
contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows
to only do a single metric feature lookup inside tcp_ecn_create_request().

Joint work with Florian Westphal.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

fib, fib6: reject invalid feature bits

Feature bits that are invalid should not be accepted by the kernel,
only the lower 4 bits may be configured, but not the remaining ones.
Even from these 4, 2 of them are unused.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: fib6: reduce identation in ip6_convert_metrics

Reduce the identation a bit, there's no need to artificically have
it increased.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: fib: move metrics parsing to a helper

fib_create_info() is already quite large, so before adding more
code to the metrics section move that to a helper, similar to
ip6_convert_metrics.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

IGMP: Document igmp_link_local_mcast_reports

Document the addition of a new sysctl variable which controls the
generation of IGMP reports for link local multicast groups in the
224.0.0.X range.

IGMP reports for local multicast groups can now be optionally
inhibited by setting the value to zero e.g.:
echo 0 > /proc/sys/net/ipv4/igmp_link_local_mcast_reports

To retain backwards compatibility the previous behaviour is retained
by default on system boot or reverted by setting the value back to
non-zero.

Signed-off-by: Philip Downey <pdowney@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ip-tunnel: Use API to access tunnel metadata options.

Currently tun-info options pointer is used in few cases to
pass options around. But tunnel options can be accessed using
ip_tunnel_info_opts() API without using the pointer. Following
patch removes the redundant pointer and consistently make use
of API.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: fix 32b build

Address remaining issue after 80ec192.

Signed-off-by: Madalin Bucur <madalin.bucur@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: Fix 32-bit build.

   net/ipv4/af_inet.c: In function 'snmp_get_cpu_field64':
>> net/ipv4/af_inet.c:1486:26: error: 'offt' undeclared (first use in this function)
      v = *(((u64 *)bhptr) + offt);
                             ^
   net/ipv4/af_inet.c:1486:26: note: each undeclared identifier is reported only once for each function it appears in
   net/ipv4/af_inet.c: In function 'snmp_fold_field64':
>> net/ipv4/af_inet.c:1499:39: error: 'offct' undeclared (first use in this function)
      res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset);
                                          ^
>> net/ipv4/af_inet.c:1499:10: error: too many arguments to function 'snmp_get_cpu_field'
      res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset);
             ^
   net/ipv4/af_inet.c:1455:5: note: declared here
    u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt)
        ^

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netlink: rx mmap: fix POLLIN condition

Poll() returns immediately after setting the kernel current frame
(ring->head) to SKIP from user space even though there is no new
frame. And in a case of all frames is VALID, user space program
unintensionally sets (only) kernel current frame to UNUSED, then
calls poll(), it will not return immediately even though there are
VALID frames.

To avoid situations like above, I think we need to scan all frames
to find VALID frames at poll() like netlink_alloc_skb(),
netlink_forward_ring() finding an UNUSED frame at skb allocation.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'thunderx-features-fixes'

Aleksey Makarov says:

====================
net: thunderx: New features and fixes

v2:
  - The unused affinity_mask field of the structure cmp_queue
  has been deleted. (thanks to David Miller)
  - The unneeded initializers have been dropped. (thanks to Alexey Klimov)
  - The commit message "net: thunderx: Rework interrupt handling"
  has been fixed. (thanks to Alexey Klimov)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderx: Support for internal loopback mode

Support for setting VF's corresponding BGX LMAC in internal
loopback mode. This mode can be used for verifying basic HW
functionality such as packet I/O, RX checksum validation,
CQ/RBDR interrupts, stats e.t.c. Useful when DUT has no external
network connectivity.

'loopback' mode can be enabled or disabled via ethtool.

Note: This feature is not supported when no of VFs enabled are
morethan no of physical interfaces i.e active BGX LMACs

Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderx: Support for upto 96 queues for a VF

This patch adds support for handling multiple qsets assigned to a
single VF. There by increasing no of queues from earlier 8 to max
no of CPUs in the system i.e 48 queues on a single node and 96 on
dual node system. User doesn't have option to assign which Qsets/VFs
to be merged. Upon request from VF, PF assigns next free Qsets as
secondary qsets. To maintain current behavior no of queues is kept
to 8 by default which can be increased via ethtool.

If user wants to unbind NICVF driver from a secondary Qset then it
should be done after tearing down primary VF's interface.

Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: Robert Richter <rrichter@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderx: Rework interrupt handling

Rework interrupt handler to avoid checking IRQ affinity of
CQ interrupts. Now separate handlers are registered for each IRQ
including RBDR. Register interrupt handlers for only those
which are being used. Add nicvf_dump_intr_status() and use it
in irq handlers.

Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderx: Support for HW VLAN stripping

This patch configures HW to strip 802.1Q header if found in a
receiving packet. The stripped VLAN ID and TCI information is
passed on to software via CQE_RX. Also sets netdev's 'vlan_features'
so that other HW offload features can be used for tagged packets.

This offload feature can be enabled or disabled via ethtool.

Network stack normally ignores RPS for 802.1Q packets and hence low
throughput. With this offload enabled throughput for tagged packets
will be almost same as normal packets.

Note: This patch doesn't enable HW VLAN insertion for transmit packets.

Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderx: Receive hashing HW offload support

Adding support for receive hashing HW offload by using RSS_ALG
and RSS_TAG fields of CQE_RX descriptor. Also removed dependency
on minimum receive queue count to configure RSS so that hash is
always generated.

This hash is used by RPS logic to distribute flows across multiple
CPUs. Offload can be disabled via ethtool.

Signed-off-by: Robert Richter <rrichter@cavium.com>
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderx: mailboxes: remove code duplication

Use the nicvf_send_msg_to_pf() function in the mailbox code.

Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: Robert Richter <rrichter@cavium.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderx: Add receive error stats reporting via ethtool

Added ethtool support to dump receive packet error statistics reported
in CQE. Also made some small fixes

Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderx: fix MAINTAINERS

The liquidio and thunder drivers have different maintainers.

Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'snmp-stat-aggregation'

Raghavendra K T says:

====================
Optimize the snmp stat aggregation for large cpus

While creating 1000 containers, perf is showing lot of time spent in
snmp_fold_field on a large cpu system.

The current patch tries to improve by reordering the statistics gathering.

Please note that similar overhead was also reported while creating
veth pairs  https://lkml.org/lkml/2013/3/19/556

Changes in V4:
- remove 'item' variable and use IPSTATS_MIB_MAX to avoid sparse
   warning (Eric) also remove 'item' parameter (Joe)
- add missing memset of padding.

Changes in V3:
- use memset to initialize temp buffer in leaf function. (David)
- use memcpy to copy the buffer data to stat instead of unalign_pu (Joe)
- Move buffer definition to leaf function __snmp6_fill_stats64() (Eric)
-
Changes in V2:
- Allocate the stat calculation buffer in stack. (Eric)

Setup:
160 cpu (20 core) baremetal powerpc system with 1TB memory

1000 docker containers was created with command
docker run -itd  ubuntu:15.04  /bin/bash in loop

observation:
Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) perf data showed, creating veth interfaces resulting in
the below code path was taking more time.

rtnl_fill_ifinfo
  -> inet6_fill_link_af
    -> inet6_fill_ifla6_attrs
      -> snmp_fold_field

proposed idea:
currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data to of an item (iteratively for around 36 items).
The patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Performance of docker creation improved by around more than 2x
after the patch.

before the patch:
================
3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617
real 0m6.836s
user 0m0.095s
sys 0m0.011s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
=======================================================
    50.73%  docker           [kernel.kallsyms]       [k] snmp_fold_field
     9.07%  swapper          [kernel.kallsyms]       [k] snooze_loop
     3.49%  docker           [kernel.kallsyms]       [k] veth_stats_one
     2.85%  swapper          [kernel.kallsyms]       [k] _raw_spin_lock
     1.37%  docker           docker                  [.] backtrace_qsort
     1.31%  docker           docker                  [.] strings.FieldsFunc

  cache-misses:  2.7%

after the patch:
=============
9178273e9df399c8290b6c196e4aef9273be2876225f63b14a60cf97eacfafb5
real 0m3.249s
user 0m0.088s
sys 0m0.020s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
=======================================================
    10.57%  docker           docker                [.] scanblock
     8.37%  swapper          [kernel.kallsyms]     [k] snooze_loop
     6.91%  docker           [kernel.kallsyms]     [k] snmp_get_cpu_field
     6.67%  docker           [kernel.kallsyms]     [k] veth_stats_one
     3.96%  docker           docker                [.] runtime_MSpan_Sweep
     2.47%  docker           docker                [.] strings.FieldsFunc

cache-misses: 1.41 %

Please let me know if you have suggestions/comments.
Thanks Eric, Joe and David for the comments.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Optimize snmp stat aggregation by walking all the percpu data at once

Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.

reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data of an item (iteratively for around 36 items).

idea: This patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Docker creation got faster by more than 2x after the patch.

Result:
                       Before           After
Docker creation time   6.836s           3.25s
cache miss             2.7%             1.41%

perf before:
    50.73%  docker           [kernel.kallsyms]       [k] snmp_fold_field
     9.07%  swapper          [kernel.kallsyms]       [k] snooze_loop
     3.49%  docker           [kernel.kallsyms]       [k] veth_stats_one
     2.85%  swapper          [kernel.kallsyms]       [k] _raw_spin_lock

perf after:
    10.57%  docker           docker                [.] scanblock
     8.37%  swapper          [kernel.kallsyms]     [k] snooze_loop
     6.91%  docker           [kernel.kallsyms]     [k] snmp_get_cpu_field
     6.67%  docker           [kernel.kallsyms]     [k] veth_stats_one

changes/ideas suggested:
Using buffer in stack (Eric), Usage of memset (David), Using memcpy in
place of unaligned_put (Joe).

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Introduce helper functions to get the per cpu data

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/davem/net

Merge branch 'ovs-vport-cleanup'

Pravin B Shelar says:

====================
openvswitch: Cleanup post vport conversion.

After converting all vport to netdev implmentations there
is no need for some of vport functionality.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: Remove vport-net

This structure is not used anymore.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: Remove vport stats.

Since all vport types are now backed by netdev, we can directly
use netdev stats. Following patch removes redundant stat
from vport.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: Remove egress_tun_info.

tun info is passed using skb-dst pointer. Now we have
converted all vports to netdev based implementation so
Now we can remove redundant pointer to tun-info from OVS_CB.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: Remove vport get_name()

Remove unused get_name() function pointer from vport ops.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

geneve: Use GRO cells infrastructure.

Geneve can benefit from GRO at the device level in a manner similar
to other tunnels, especially as hardware offloads are still emerging.

After this patch, aggregated frames are seen on the tunnel interface.
Single stream throughput nearly doubles in ideal circumstances (on
old hardware).

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: retain parsed IPv6 header fields in flow on error skipping extension headers

When an error occurs skipping IPv6 extension headers retain the already
parsed IP protocol and IPv6 addresses in the flow. Also assume that the
packet is not a fragment in the absence of information to the contrary;
that is always use the frag_off value set by ipv6_skip_exthdr().

This allows matching on the IP protocol and IPv6 addresses of packets
with malformed extension headers.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'for-upstream' of git://git./linux/kernel/git/bluetooth/bluetooth-next

Johan Hedberg says:

====================
pull request: bluetooth-next 2015-08-28

One more bunch of Bluetooth patches for 4.3:

- Crash fix for hci_bcm driver
- Enhancements to hci_intel driver (e.g. baudrate configuration)
- Fix for SCO link type after multiple connect attempts
- Cleanups & minor fixes in a few other places

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/smsc911x: Fix deferred probe for interrupt

The interrupt handler may not be available when smsc911x probes if the
interrupt handler is a GPIO controller for example. Let's fix that
by adding handling for -EPROBE_DEFER.

Cc: Steve Glendinning <steve.glendinning@shawell.net>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tnl-ipv4-ipv6'

Jiri Benc says:

====================
tunnels: fix incorrect IPv4/v6 headers interpretation

With tunneling, it is currently possible to get an IPv6 header and interpret
it as an IPv4 header, or to interpret an IPv6 address as an IPv4 address
(and vice versa). This leads to things like sending packets to incorrect
address, IPv6 flow label being interpreted as IP packet length, etc.

Fix several places where this can happen.

Most of this is net-next only. The third patch affects net, too, but it
doesn't seem there's anything in user space that sets the attribute at all
currently, thus net-next is fine.

Changelog:
v2: fixed geneve after incorrect rebase on top of Pravin's patches
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

vxlan: do not receive IPv4 packets on IPv6 socket

By default (subject to the sysctl settings), IPv6 sockets listen also for
IPv4 traffic. Vxlan is not prepared for that and expects IPv6 header in
packets received through an IPv6 socket.

In addition, it's currently not possible to have both IPv4 and IPv6 vxlan
tunnel on the same port (unless bindv6only sysctl is enabled), as it's not
possible to create and bind both IPv4 and IPv6 vxlan interfaces and there's
no way to specify both IPv4 and IPv6 remote/group IP addresses.

Set IPV6_V6ONLY on vxlan sockets to fix both of these issues. This is not
done globally in udp_tunnel, as l2tp and tipc seems to work okay when
receiving IPv4 packets on IPv6 socket and people may rely on this behavior.
The other tunnels (geneve and fou) do not support IPv6.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fou: reject IPv6 config

fou does not really support IPv6 encapsulation. After an UDP socket is
created in fou_create, the encap_rcv callback is set either to fou_udp_recv
or to gue_udp_recv. Both of those unconditionally assume that the received
packet has an IPv4 header and access the data at network_header as it was an
IPv4 header. This leads to IPv6 flow label being interpreted as IP packet
length, etc.

Disallow fou tunnel to be configured as IPv6 until real IPv6 support is
added to fou.

CC: Tom Herbert <tom@herbertland.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ip_tunnels: record IP version in tunnel info

There's currently nothing preventing directing packets with IPv6
encapsulation data to IPv4 tunnels (and vice versa). If this happens,
IPv6 addresses are incorrectly interpreted as IPv4 ones.

Track whether the given ip_tunnel_key contains IPv4 or IPv6 data. Store this
in ip_tunnel_info. Reject packets at appropriate places if they are supposed
to be encapsulated into an incompatible protocol.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>