Zhu Yanjun [Thu, 10 Aug 2017 08:13:12 +0000 (04:13 -0400)]
forcedeth: replace init_timer_deferrable with setup_deferrable_timer
Replace init_timer_deferrable with setup_deferrable_timer to simplify
the source code.
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maxim Uvarov [Thu, 10 Aug 2017 07:47:47 +0000 (10:47 +0300)]
drivers: net: davinci_mdio: print bus frequency
Frequency can be adjusted in DT it make sense to
print current used value on driver init.
Signed-off-by: Max Uvarov <muvarov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maxim Uvarov [Thu, 10 Aug 2017 07:47:46 +0000 (10:47 +0300)]
drivers: net: davinci_mdio: remove busy loop on wait user access
Polling 14 mdio devices on single mdio bus eats 30% of 1Ghz cpu time
due to busy loop in wait(). Add small delay to relax cpu.
Signed-off-by: Max Uvarov <muvarov@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 11 Aug 2017 21:00:14 +0000 (14:00 -0700)]
Merge branch 'netvsc-minor-fixes-and-improvements'
Stephen Hemminger says:
====================
netvsc: minor fixes and improvements
These are non-critical bug fixes, related to functionality now in net-next.
1. delaying the automatic bring up of VF device to allow udev to change name.
2. performance improvement
3. handle MAC address change with VF; mostly propogate the error that VF gives.
4. minor cleanups
5. allow setting send/receive buffer size with ethtool.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Thu, 10 Aug 2017 00:46:12 +0000 (17:46 -0700)]
netvsc: keep track of some non-fatal overload conditions
Add ethtool statistics for case where send chimmeny buffer is
exhausted and driver has to fall back to doing scatter/gather
send. Also, add statistic for case where ring buffer is full and
receive completions are delayed.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Thu, 10 Aug 2017 00:46:11 +0000 (17:46 -0700)]
netvsc: allow controlling send/recv buffer size
Control the size of the buffer areas via ethtool ring settings.
They aren't really traditional hardware rings, but host API breaks
receive and send buffer into chunks. The final size of the chunks are
controlled by the host.
The default value of send and receive buffer area for host DMA
is much larger than it needs to be. Experimentation shows that
4M receive and 1M send is sufficient.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Thu, 10 Aug 2017 00:46:10 +0000 (17:46 -0700)]
netvsc: remove unnecessary check for NULL hdr
The function init_page_array is always called with a valid pointer
to RNDIS header. No check for NULL is needed.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Thu, 10 Aug 2017 00:46:09 +0000 (17:46 -0700)]
netvsc: remove unnecessary cast of void pointer
Assignment to a typed pointer is sufficient in C.
No cast is needed.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Thu, 10 Aug 2017 00:46:08 +0000 (17:46 -0700)]
netvsc: whitespace cleanup
Fix some minor indentation issues.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Thu, 10 Aug 2017 00:46:07 +0000 (17:46 -0700)]
netvsc: no need to allocate send/receive on numa node
The send and receive buffers are both per-device (not per-channel).
The associated NUMA node is a property of the CPU which is per-channel
therefore it makes no sense to force the receive/send buffer to be
allocated on a particular node (since it is a shared resource).
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Thu, 10 Aug 2017 00:46:06 +0000 (17:46 -0700)]
netvsc: check error return when restoring channels and mtu
If setting new values fails, and the attempt to restore original
settings fails. Then log an error and leave device down.
This should never happen, but if it does don't go down in flames.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Thu, 10 Aug 2017 00:46:05 +0000 (17:46 -0700)]
netvsc: propagate MAC address change to VF slave
If VF is slaved to synthetic device, then any change to netvsc
MAC address should be propagated to the slave device.
If slave device doesn't support MAC address change then it
should also be an error to attempt to change synthetic NIC MAC
address.
It also fixes the error unwind in the original code.
If give a bad address, the old code would change the device
MAC address anyway.
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Thu, 10 Aug 2017 00:46:04 +0000 (17:46 -0700)]
netvsc: don't signal host twice if empty
When hv_pkt_iter_next() returns NULL, it has already called
hv_pkt_iter_close(). Calling it twice can lead to extra host signal.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Thu, 10 Aug 2017 00:46:03 +0000 (17:46 -0700)]
netvsc: delay setup of VF device
When VF device is discovered, delay bring it automatically up in
order to allow userspace to some simple changes (like renaming).
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Wed, 9 Aug 2017 21:35:50 +0000 (00:35 +0300)]
phylink: Fix an uninitialized variable bug
"ret" isn't necessarily initialized here.
Fixes:
9525ae83959b ("phylink: add phylink infrastructure")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Wed, 9 Aug 2017 20:28:04 +0000 (13:28 -0700)]
liquidio: removed check for queue size alignment
There is no restriction on queue size alignment. Hence removing check for
valid queue size.
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Wed, 9 Aug 2017 19:07:08 +0000 (12:07 -0700)]
liquidio: rx/tx queue cleanup
When deleting a queue, clear its corresponding bit in the qmask, vfree its
memory, clear out the pointer that's pointing to it, and decrement the
queue count.
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Signed-off-by: Felix Manlunas <fmanlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 11 Aug 2017 20:47:01 +0000 (13:47 -0700)]
Merge branch 'net-sched-let-the-offloader-decide-what-to-offload'
Jiri Pirko says:
====================
net: sched: let the offloader decide what to offload
Currently there is a Qdisc_class_ops->tcf_cl_offload callback
that is called to find out if cls would offload rule or not.
This is only supported by sch_ingress and sch_clsact.
So the Qdisc are to decide. However, the driver knows what is he
able to offload, so move the decision making to drivers completely.
Just pass classid there and provide set of helpers to allow
identification of qdisc.
As a side effect, this actually allows clsact egress rules
offload in mlxsw.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 9 Aug 2017 12:30:35 +0000 (14:30 +0200)]
net: sched: remove cops->tcf_cl_offload
cops->tcf_cl_offload is no longer needed, as the drivers check what they
can and cannot offload using the classid identify helpers. So remove this.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 9 Aug 2017 12:30:34 +0000 (14:30 +0200)]
net: sched: remove handle propagation down to the drivers
There is no longer need to use handle in drivers, so remove it from
tc_cls_common_offload struct.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 9 Aug 2017 12:30:33 +0000 (14:30 +0200)]
net: sched: use newly added classid identity helpers
Instead of checking handle, which does not have the inner class
information and drivers wrongly assume clsact->egress as ingress, use
the newly introduced classid identification helpers.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 9 Aug 2017 12:30:32 +0000 (14:30 +0200)]
net: sched: propagate classid down to offload drivers
Drivers need classid to decide they support this specific qdisc+class
or not. So propagate it down via the tc_cls_common_offload struct.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 9 Aug 2017 12:30:31 +0000 (14:30 +0200)]
net: sched: Add helpers to identify classids
Offloading drivers need to understand what qdisc class a filter is added
to. Currently they only need to identify ingress, clsact->ingress and
clsact->egress. So provide these helpers.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Girish Moodalbail [Wed, 9 Aug 2017 08:09:28 +0000 (01:09 -0700)]
geneve: use netlink_ext_ack for error reporting in rtnl operations
Add extack error messages for failure paths while creating/modifying
geneve devices. Once extack support is added to iproute2, more
meaningful and helpful error messages will be displayed making it easy
for users to discern what went wrong.
Before:
=======
$ ip link add gen1 address 0:1:2:3:4:5:6 type geneve id 200 \
remote 192.168.13.2
RTNETLINK answers: Invalid argument
After:
======
$ ip link add gen1 address 0:1:2:3:4:5:6 type geneve id 200 \
remote 192.168.13.2
Error: Provided link layer address is not Ethernet
Also, netdev_dbg() calls used to log errors associated with Netlink
request have been removed.
Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 11 Aug 2017 17:02:45 +0000 (10:02 -0700)]
Merge branch 'sctp-remove-typedefs-from-structures-part-6'
Xin Long says:
====================
sctp: remove typedefs from structures part 6
As we know, typedef is suggested not to use in kernel, even checkpatch.pl
also gives warnings about it. Now sctp is using it for many structures.
All this kind of typedef's using should be removed. This patchset is the
part 6 to remove all typedefs in include/net/sctp/structs.h, command.h
and sm.h.
Just as the part 1-5, No any code's logic would be changed in these patches,
only cleaning up.
Note that this is the last part for this typedef cleaning up. after this
patchset, no more inappropriate typedefs in sctp. It's also to tidy some
codes when removing them, like fixing many indents, reodering some local
params, especially in the last 2 patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 11 Aug 2017 02:23:58 +0000 (10:23 +0800)]
sctp: fix some indents in sm_make_chunk.c
There are some bad indents of functions' defination in sm_make_chunk.c.
They have been there since beginning, it was probably caused by that
the typedef sctp_chunk_t was replaced with struct sctp_chunk.
So it's the best time to fix them in this patchset, it's also to fix
some bad indents in other functions' defination in sm_make_chunk.c.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 11 Aug 2017 02:23:57 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_disposition_t
This patch is to remove the typedef sctp_disposition_t, and
replace with enum sctp_disposition in the places where it's
using this typedef.
It's also to fix the indent for many functions' defination.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 11 Aug 2017 02:23:56 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_sm_table_entry_t
This patch is to remove the typedef sctp_sm_table_entry_t, and
replace with struct sctp_sm_table_entry in the places where it's
using this typedef.
It is also to fix some indents.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 11 Aug 2017 02:23:55 +0000 (10:23 +0800)]
sctp: remove the unused typedef sctp_sm_command_t
Remove this typedef including the struct, there is even no places
using it.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 11 Aug 2017 02:23:54 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_verb_t
This patch is to remove the typedef sctp_verb_t, and
replace with enum sctp_verb in the places where it's
using this typedef.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 11 Aug 2017 02:23:53 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_arg_t
This patch is to remove the typedef sctp_arg_t, and
replace with union sctp_arg in the places where it's
using this typedef.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 11 Aug 2017 02:23:52 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_cmd_seq_t
This patch is to remove the typedef sctp_cmd_seq_t, and
replace with struct sctp_cmd_seq in the places where it's
using this typedef.
Note that it doesn't fix many indents although it should,
as sctp_disposition_t's removal would mess them up again.
So better to fix them when removing sctp_disposition_t in
the later patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 11 Aug 2017 02:23:51 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_cmd_t
This patch is to remove the typedef sctp_cmd_t, and
replace with enum sctp_cmd in the places where it's
using this typedef.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 11 Aug 2017 02:23:50 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_socket_type_t
This patch is to remove the typedef sctp_socket_type_t, and
replace with enum sctp_socket_type in the places where it's
using this typedef.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 11 Aug 2017 02:23:49 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_dbg_objcnt_entry_t
This patch is to remove the typedef sctp_dbg_objcnt_entry_t, and
replace with struct sctp_dbg_objcnt_entry in the places where it's
using this typedef.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 11 Aug 2017 02:23:48 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_cmsgs_t
This patch is to remove the typedef sctp_cmsgs_t, and
replace with struct sctp_cmsgs in the places where it's
using this typedef.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 11 Aug 2017 02:23:47 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_endpoint_type_t
This patch is to remove the typedef sctp_endpoint_type_t, and
replace with enum sctp_endpoint_type in the places where it's
using this typedef.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 11 Aug 2017 02:23:46 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_sender_hb_info_t
This patch is to remove the typedef sctp_sender_hb_info_t, and
replace with struct sctp_sender_hb_info in the places where it's
using this typedef.
It is also to use sizeof(variable) instead of sizeof(type).
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 11 Aug 2017 02:23:45 +0000 (10:23 +0800)]
sctp: remove the unused typedef sctp_packet_phandler_t
Remove this function typedef, there is even no places
using it.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 10 Aug 2017 19:11:16 +0000 (12:11 -0700)]
Merge git://git./linux/kernel/git/davem/net
Mainline had UFO fixes, but UFO is removed in net-next so we
take the HEAD hunks.
Minor context conflict in bcmsysport statistics bug fix.
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 10 Aug 2017 17:30:29 +0000 (10:30 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix handling of initial STATE message in TIPC, from Jon Paul Maloy.
2) Fix stats handling in bcm_sysport_get_stats(), from Florian
Fainelli.
3) Reject
16777215 VNI value in geneve_validate(), from Girish
Moodalbail.
4) Fix initial IGMP sysctl setting regression, from Nikolay Borisov.
5) Once a UFO fragmented frame is treated as UFO, we should continue
doing so. Likewise once a frame has been segmented, we should
continue doing that and not try to convert it to a UFO frame. From
Willem de Bruijn.
6) Test the AF_PACKET RX/TX ring pg_vec state under the socket lock to
prevent races. From Willem de Bruijn.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
packet: fix tp_reserve race in packet_set_ring
udp: consistently apply ufo or fragmentation
net: sched: set xt_tgchk_param par.nft_compat as 0 in ipt_init_target
igmp: Fix regression caused by igmp sysctl namespace code.
geneve: maximum value of VNI cannot be used
net: systemport: Fix software statistics for SYSTEMPORT Lite
tipc: remove premature ESTABLISH FSM event at link synchronization
Willem de Bruijn [Thu, 10 Aug 2017 16:41:58 +0000 (12:41 -0400)]
packet: fix tp_reserve race in packet_set_ring
Updates to tp_reserve can race with reads of the field in
packet_set_ring. Avoid this by holding the socket lock during
updates in setsockopt PACKET_RESERVE.
This bug was discovered by syzkaller.
Fixes:
8913336a7e8d ("packet: add PACKET_RESERVE sockopt")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Thu, 10 Aug 2017 16:29:19 +0000 (12:29 -0400)]
udp: consistently apply ufo or fragmentation
When iteratively building a UDP datagram with MSG_MORE and that
datagram exceeds MTU, consistently choose UFO or fragmentation.
Once skb_is_gso, always apply ufo. Conversely, once a datagram is
split across multiple skbs, do not consider ufo.
Sendpage already maintains the first invariant, only add the second.
IPv6 does not have a sendpage implementation to modify.
A gso skb must have a partial checksum, do not follow sk_no_check_tx
in udp_send_skb.
Found by syzkaller.
Fixes:
e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 10 Aug 2017 16:50:22 +0000 (09:50 -0700)]
Merge branch 'rtnetlink-fix-initial-rtnl-pushdown-fallout'
Florian Westphal says:
====================
rtnetlink: fix initial rtnl pushdown fallout
This series fixes various bugs and splats reported since the
allow-handler-to-run-with-no-rtnl series went in.
Last patch adds a script that can be used to add further
tests in case more bugs are reported.
In case you prefer reverting the original series instead of
fixing fallout I can resend this patch on its own.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Thu, 10 Aug 2017 14:53:02 +0000 (16:53 +0200)]
selftests: add rtnetlink test script
add a simple script to exercise some rtnetlink call paths, so KASAN,
lockdep etc. can yell at developer before patches are sent upstream.
This can be extended to also cover bond, team, vrf and the like.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Thu, 10 Aug 2017 14:53:01 +0000 (16:53 +0200)]
rtnetlink: fallback to UNSPEC if current family has no doit callback
We need to use PF_UNSPEC in case the requested family has no doit
callback, otherwise this now fails with EOPNOTSUPP instead of running the
unspec doit callback, as before.
Fixes:
6853dd488119 ("rtnetlink: protect handler table with rcu")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Thu, 10 Aug 2017 14:53:00 +0000 (16:53 +0200)]
rtnetlink: init handler refcounts to 1
If using CONFIG_REFCOUNT_FULL=y we get following splat:
refcount_t: increment on 0; use-after-free.
WARNING: CPU: 0 PID: 304 at lib/refcount.c:152 refcount_inc+0x47/0x50
Call Trace:
rtnetlink_rcv_msg+0x191/0x260
...
This warning is harmless (0 is "no callback running", not "memory
was freed").
Use '1' as the new 'no handler is running' base instead of 0 to avoid
this.
Fixes:
019a316992ee ("rtnetlink: add reference counting to prevent module unload while dump is in progress")
Reported-by: Sabrina Dubroca <sdubroca@redhat.com>
Reported-by: kernel test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Thu, 10 Aug 2017 14:52:59 +0000 (16:52 +0200)]
rtnetlink: switch rtnl_link_get_slave_info_data_size to rcu
David Ahern reports following splat:
RTNL: assertion failed at net/core/dev.c (5717)
netdev_master_upper_dev_get+0x5f/0x70
if_nlmsg_size+0x158/0x240
rtnl_calcit.isra.26+0xa3/0xf0
rtnl_link_get_slave_info_data_size currently assumes RTNL protection, but
there appears to be no hard requirement for this, so use rcu instead.
At the time of this writing, there are three 'get_slave_size' callbacks
(now invoked under rcu): bond_get_slave_size, vrf_get_slave_size and
br_port_get_slave_size, all return constant only (i.e. they don't sleep).
Fixes:
6853dd488119 ("rtnetlink: protect handler table with rcu")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Thu, 10 Aug 2017 14:52:58 +0000 (16:52 +0200)]
rtnetlink: do not use RTM_GETLINK directly
Userspace sends RTM_GETLINK type, but the kernel substracts
RTM_BASE from this, i.e. 'type' doesn't contain RTM_GETLINK
anymore but instead RTM_GETLINK - RTM_BASE.
This caused the calcit callback to not be invoked when it
should have been (and vice versa).
While at it, also fix a off-by one when checking family index. vs
handler array size.
Fixes:
e1fa6d216dd ("rtnetlink: call rtnl_calcit directly")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Thu, 10 Aug 2017 14:52:57 +0000 (16:52 +0200)]
rtnetlink: use rcu_dereference_raw to silence rcu splat
Ido reports a rcu splat in __rtnl_register.
The splat is correct; as rtnl_register doesn't grab any logs
and doesn't use rcu locks either. It has always been like this.
handler families are not registered in parallel so there are no
races wrt. the kmalloc ordering.
The only reason to use rcu_dereference in the first place was to
avoid sparse from complaining about this.
Thus this switches to _raw() to not have rcu checks here.
The alternative is to add rtnl locking to register/unregister,
however, I don't see a compelling reason to do so as this has been
lockless for the past twenty years or so.
Fixes:
6853dd4881 ("rtnetlink: protect handler table with rcu")
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 10 Aug 2017 16:36:06 +0000 (09:36 -0700)]
Merge git://git./linux/kernel/git/davem/sparc
Pull sparc updates from David Miller:
1) Recognize M8 cpus, just basic chip ID matching, from Allen Pais.
2) Prevent crashes when bringing up sunvdc virtual block devices in
some environments. From Jim Quigley.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sunvdc: prevent sunvdc panic when mpgroup disk added to guest domain
sparc64: Increase max_phys_bits to 51 and VA bits to 53 for M8.
sparc64: recognize and support sparc M8 cpu type
sparc64: properly name the cpu constants
John Crispin [Thu, 10 Aug 2017 08:09:03 +0000 (10:09 +0200)]
net: core: fix compile error inside flow_dissector due to new dsa callback
The following error was introduced by
commit
43e665287f93 ("net-next: dsa: fix flow dissection")
due to a missing #if guard
net/core/flow_dissector.c: In function '__skb_flow_dissect':
net/core/flow_dissector.c:448:18: error: 'struct net_device' has no member named 'dsa_ptr'
ops = skb->dev->dsa_ptr->tag_ops;
^
make[3]: *** [net/core/flow_dissector.o] Error 1
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 10 Aug 2017 05:51:47 +0000 (22:51 -0700)]
Merge branch 'dsa-flow-dissection'
John Crispin says:
====================
net-next: dsa: fix flow dissection
RPS and probably other kernel features are currently broken on some if not
all DSA devices. The root cause of this is that skb_hash will call the
flow_dissector. At this point the skb still contains the magic switch
header and the skb->protocol field is not set up to the correct 802.3
value yet. By the time the tag specific code is called, removing the header
and properly setting the protocol an invalid hash is already set. In the
case of the mt7530 this will result in all flows always having the same
hash.
Changes since RFC:
* use a callback instead of static values
* add cover letter
====================
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Crispin [Wed, 9 Aug 2017 12:41:19 +0000 (14:41 +0200)]
net-next: dsa: fix flow dissection
RPS and probably other kernel features are currently broken on some if not
all DSA devices. The root cause of this is that skb_hash will call the
flow_dissector. At this point the skb still contains the magic switch
header and the skb->protocol field is not set up to the correct 802.3
value yet. By the time the tag specific code is called, removing the header
and properly setting the protocol an invalid hash is already set. In the
case of the mt7530 this will result in all flows always having the same
hash.
Signed-off-by: Muciri Gatimu <muciri@openmesh.com>
Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
Signed-off-by: John Crispin <john@phrozen.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Crispin [Wed, 9 Aug 2017 12:41:18 +0000 (14:41 +0200)]
net-next: tag_mtk: add flow_dissect callback to the ops struct
The MT7530 inserts the 4 magic header in between the 802.3 address and
protocol field. The patch implements the callback that can be called by
the flow dissector to figure out the real protocol and offset of the
network header. With this patch applied we can properly parse the packet
and thus make hashing function properly.
Signed-off-by: Muciri Gatimu <muciri@openmesh.com>
Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
Signed-off-by: John Crispin <john@phrozen.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Crispin [Wed, 9 Aug 2017 12:41:17 +0000 (14:41 +0200)]
net-next: dsa: add flow_dissect callback to struct dsa_device_ops
When the flow dissector first sees packets coming in on a DSA devices the
802.3 header wont be located where the code expects it to be as the tag
is still present. Adding this new callback allows a DSA device to provide a
new function that the flow_dissector can use to get the correct protocol
and offset of the network header.
Signed-off-by: Muciri Gatimu <muciri@openmesh.com>
Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
Signed-off-by: John Crispin <john@phrozen.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Crispin [Wed, 9 Aug 2017 12:41:16 +0000 (14:41 +0200)]
net-next: dsa: move struct dsa_device_ops to the global header file
We need to access this struct from within the flow_dissector to fix
dissection for packets coming in on DSA devices.
Signed-off-by: Muciri Gatimu <muciri@openmesh.com>
Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Wed, 9 Aug 2017 10:15:19 +0000 (18:15 +0800)]
net: sched: set xt_tgchk_param par.nft_compat as 0 in ipt_init_target
Commit
55917a21d0cc ("netfilter: x_tables: add context to know if
extension runs from nft_compat") introduced a member nft_compat to
xt_tgchk_param structure.
But it didn't set it's value for ipt_init_target. With unexpected
value in par.nft_compat, it may return unexpected result in some
target's checkentry.
This patch is to set all it's fields as 0 and only initialize the
non-zero fields in ipt_init_target.
v1->v2:
As Wang Cong's suggestion, fix it by setting all it's fields as
0 and only initializing the non-zero fields.
Fixes:
55917a21d0cc ("netfilter: x_tables: add context to know if extension runs from nft_compat")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Borisov [Wed, 9 Aug 2017 11:38:04 +0000 (14:38 +0300)]
igmp: Fix regression caused by igmp sysctl namespace code.
Commit
dcd87999d415 ("igmp: net: Move igmp namespace init to correct file")
moved the igmp sysctls initialization from tcp_sk_init to igmp_net_init. This
function is only called as part of per-namespace initialization, only if
CONFIG_IP_MULTICAST is defined, otherwise igmp_mc_init() call in ip_init is
compiled out, casuing the igmp pernet ops to not be registerd and those sysctl
being left initialized with 0. However, there are certain functions, such as
ip_mc_join_group which are always compiled and make use of some of those
sysctls. Let's do a partial revert of the aforementioned commit and move the
sysctl initialization into inet_init_net, that way they will always have
sane values.
Fixes:
dcd87999d415 ("igmp: net: Move igmp namespace init to correct file")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=196595
Reported-by: Gerardo Exequiel Pozzi <vmlinuz386@gmail.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 10 Aug 2017 05:45:36 +0000 (22:45 -0700)]
Merge branch 'mediatek-bring-up-QDMA-RX-ring-0'
John Crispin says:
====================
net-next: mediatek: bring up QDMA RX ring 0
The MT7623 has several DMA rings. Inside the SW path, the core will use
the PDMA when receiving traffic. While bringing up the HW path we noticed
that the PPE requires the QDMA RX to also be brought up as it uses this
ring internally for its flow scheduling.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
John Crispin [Wed, 9 Aug 2017 10:09:32 +0000 (12:09 +0200)]
net-next: mediatek: bring up QDMA RX ring 0
This patch is in preparation for adding HW flow and QoS offloading. For
those features to work, the driver needs to bring up the first QDMA RX
ring. This ring is used by the PPE offloading HW.
Signed-off-by: John Crisp in <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Crispin [Wed, 9 Aug 2017 10:09:31 +0000 (12:09 +0200)]
net-next: mediatek: fix typos inside the header file
Trivial patch fixing 2 typos.
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bhumika Goyal [Wed, 9 Aug 2017 09:32:08 +0000 (15:02 +0530)]
net: atm: make atmdev_ops const
Make these const as they are only stored in the ops field of a atm_dev
structure, which is const.
Done using Coccinelle.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bhumika Goyal [Wed, 9 Aug 2017 09:19:15 +0000 (14:49 +0530)]
atm: make atmdev_ops const
Make these structures const as they are either passed to the function
atm_dev_register having the corresponding argument as const or stored in
the ops field of a atm_dev structure, which is also const.
Done using Coccinelle.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bhumika Goyal [Wed, 9 Aug 2017 05:04:15 +0000 (10:34 +0530)]
net: dsa: make dsa_switch_ops const
Make these structures const as they are only stored in the ops field of
a dsa_switch structure, which is const.
Done using Coccinelle.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Wed, 9 Aug 2017 02:34:28 +0000 (19:34 -0700)]
liquidio: napi cleanup
Disable napi when interface is going down.
Delete napi when destroying the interface.
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Girish Moodalbail [Wed, 9 Aug 2017 00:26:24 +0000 (17:26 -0700)]
geneve: maximum value of VNI cannot be used
Geneve's Virtual Network Identifier (VNI) is 24 bit long, so the range
of values for it would be from 0 to
16777215 (2^24 -1). However, one
cannot create a geneve device with VNI set to
16777215. This patch fixes
this issue.
Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 8 Aug 2017 21:51:02 +0000 (15:51 -0600)]
net: ipv6: lower ndisc notifier priority below addrconf
ndisc_notify is used to send unsolicited neighbor advertisements
(e.g., on a link up). Currently, the ndisc notifier is run before the
addrconf notifer which means NA's are not sent for link-local addresses
which are added by the addrconf notifier.
Fix by lowering the priority of the ndisc notifier. Setting the priority
to ADDRCONF_NOTIFY_PRIORITY - 5 means it runs after addrconf and before
the route notifier which is ADDRCONF_NOTIFY_PRIORITY - 10.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Tue, 8 Aug 2017 21:45:09 +0000 (14:45 -0700)]
net: systemport: Fix software statistics for SYSTEMPORT Lite
With SYSTEMPORT Lite we have holes in our statistics layout that make us
skip over the hardware MIB counters, bcm_sysport_get_stats() was not
taking that into account resulting in reporting 0 for all SW-maintained
statistics, fix this by skipping accordingly.
Fixes:
44a4524c54af ("net: systemport: Add support for SYSTEMPORT Lite")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Tue, 8 Aug 2017 20:23:56 +0000 (22:23 +0200)]
tipc: remove premature ESTABLISH FSM event at link synchronization
When a link between two nodes come up, both endpoints will initially
send out a STATE message to the peer, to increase the probability that
the peer endpoint also is up when the first traffic message arrives.
Thereafter, if the establishing link is the second link between two
nodes, this first "traffic" message is a TUNNEL_PROTOCOL/SYNCH message,
helping the peer to perform initial synchronization between the two
links.
However, the initial STATE message may be lost, in which case the SYNCH
message will be the first one arriving at the peer. This should also
work, as the SYNCH message itself will be used to take up the link
endpoint before initializing synchronization.
Unfortunately the code for this case is broken. Currently, the link is
brought up through a tipc_link_fsm_evt(ESTABLISHED) when a SYNCH
arrives, whereupon __tipc_node_link_up() is called to distribute the
link slots and take the link into traffic. But, __tipc_node_link_up() is
itself starting with a test for whether the link is up, and if true,
returns without action. Clearly, the tipc_link_fsm_evt(ESTABLISHED) call
is unnecessary, since tipc_node_link_up() is itself issuing such an
event, but also harmful, since it inhibits tipc_node_link_up() to
perform the test of its tasks, and the link endpoint in question hence
is never taken into traffic.
This problem has been exposed when we set up dual links between pre-
and post-4.4 kernels, because the former ones don't send out the
initial STATE message described above.
We fix this by removing the unnecessary event call.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nathan Fontenot [Tue, 8 Aug 2017 20:26:18 +0000 (15:26 -0500)]
ibmvnic: Correct 'unused variable' warning in build.
Commit
a248878d7a1d ("ibmvnic: Check for transport event on driver resume")
removed the loop to kick irqs on driver resume but didn't remove the now
unused loop variable 'i'.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nathan Fontenot [Tue, 8 Aug 2017 20:24:05 +0000 (15:24 -0500)]
ibmvnic: Add netdev_dbg output for debugging
To ease debugging of the ibmvnic driver add a series of netdev_dbg()
statements to track driver status, especially during initialization,
removal, and resetting of the driver.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nathan Fontenot [Tue, 8 Aug 2017 19:28:45 +0000 (14:28 -0500)]
ibmvnic: Clean up resources on probe failure
Ensure that any resources allocated during probe are released if the
probe of the driver fails.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jim Quigley [Fri, 21 Jul 2017 13:20:15 +0000 (09:20 -0400)]
sunvdc: prevent sunvdc panic when mpgroup disk added to guest domain
Using mpgroup to define multiple paths for a virtual disk causes multiple
virtual-device-port ports to be created for that virtual device.
Each virtual-device-port port then gets a vdisk created for it by the Linux
sunvdc driver. As mpgroup is not supported by the Linux sunvdc driver it
cannot handle multiple ports for a single vdisk, leading to a kernel panic
at startup.
This fix prevents more than one vdisk per virtual-device-port being created
until full virtual disk multipathing (mpgroup) support is implemented.
Signed-off-by: Jim Quigley <Jim.Quigley@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Aaron Young <aaron.young@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 9 Aug 2017 23:57:38 +0000 (16:57 -0700)]
Merge branch 'rtnetlink-allow-selected-handlers-to-run-without-rtnl'
Florian Westphal says:
====================
rtnetlink: allow selected handlers to run without rtnl
Changes since v1:
In patch 6, don't make ipv6 route handlers lockless, they all have
assumptions on rtnl being held. Other patches are unchanged.
The RTNL mutex is used to serialize both rtnetlink calls and
dump requests.
Its also used to protect other things such as the list of current
net namespaces.
Unfortunately RTNL mutex is a performance issue, e.g. a cpu adding an
ip address prevents other cpus from seemingly unrelated tasks such as
dumping tc classifiers or doing rtnetlink route lookups.
This patch set adds basic infrastructure to start pushing the rtnl lock
down to those places that need it, or even elide it entirely in some cases.
Subsystems can now indicate that their doit() callback can run without
RTNL mutex, such callbacks can then run in parallel.
This will obviously need a lot of followup work; all current
users need to be audited/changed to benefit from this.
Initial no-rtnl spot is netns new/getid.
We have various 'get' handlers that are also a tempting target,
however, several of these depend on rtnl mutex to prevent information
from changing while objects are being read by rtnl handlers; however,
it doesn't appear impossible to change this.
Dumps are another problem entirely, see
commit
2907c35ff64708065 ("net: hold rtnl again in dump callbacks"),
this patchset doesn't touch dump requests.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Wed, 9 Aug 2017 18:41:53 +0000 (20:41 +0200)]
net: call newid/getid without rtnl mutex held
Both functions take nsid_lock and don't rely on rtnl lock.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Wed, 9 Aug 2017 18:41:52 +0000 (20:41 +0200)]
rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED
Allow callers to tell rtnetlink core that its doit callback
should be invoked without holding rtnl mutex.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Wed, 9 Aug 2017 18:41:51 +0000 (20:41 +0200)]
rtnetlink: protect handler table with rcu
Note that netlink dumps still acquire rtnl mutex via the netlink
dump infrastructure.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Wed, 9 Aug 2017 18:41:50 +0000 (20:41 +0200)]
rtnetlink: small rtnl lock pushdown
instead of rtnl lock/unload at the top level, push it down
to the called function.
This is just an intermediate step, next commit switches protection
of the rtnl_link ops table to rcu, in which case (for dumps) the
rtnl lock is acquired only by the netlink dumper infrastructure
(current lock/unlock/dump/lock/unlock rtnl sequence becomes
rcu lock/rcu unlock/dump).
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Wed, 9 Aug 2017 18:41:49 +0000 (20:41 +0200)]
rtnetlink: add reference counting to prevent module unload while dump is in progress
I don't see what prevents rmmod (unregister_all is called) while a dump
is active.
Even if we'd add rtnl lock/unlock pair to unregister_all (as done here),
thats not enough either as rtnl_lock is released right before the dump
process starts.
So this adds a refcount:
* acquire rtnl mutex
* bump refcount
* release mutex
* start the dump
... and make unregister_all remove the callbacks (no new dumps possible)
and then wait until refcount is 0.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Wed, 9 Aug 2017 18:41:48 +0000 (20:41 +0200)]
rtnetlink: make rtnl_register accept a flags parameter
This change allows us to later indicate to rtnetlink core that certain
doit functions should be called without acquiring rtnl_mutex.
This change should have no effect, we simply replace the last (now
unused) calcit argument with the new flag.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Wed, 9 Aug 2017 18:41:47 +0000 (20:41 +0200)]
rtnetlink: call rtnl_calcit directly
There is only a single place in the kernel that regisers the "calcit"
callback (to determine min allocation for dumps).
This is in rtnetlink.c for PF_UNSPEC RTM_GETLINK.
The function that checks for calcit presence at run time will first check
the requested family (which will always fail for !PF_UNSPEC as no callsite
registers this), then falls back to checking PF_UNSPEC.
Therefore we can just check if type is RTM_GETLINK and then do a direct
call. Because of fallback to PF_UNSPEC all RTM_GETLINK types used this
regardless of family.
This has the advantage that we don't need to allocate space for
the function pointer for all the other families.
A followup patch will drop the calcit function pointer from the
rtnl_link callback structure.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 9 Aug 2017 23:53:57 +0000 (16:53 -0700)]
Merge branch 'bpf-new-branches'
Daniel Borkmann says:
====================
bpf: Add BPF_J{LT,LE,SLT,SLE} instructions
This set adds BPF_J{LT,LE,SLT,SLE} instructions to the BPF
insn set, interpreter, JIT hardening code and all JITs are
also updated to support the new instructions. Basic idea is
to reduce register pressure by avoiding BPF_J{GT,GE,SGT,SGE}
rewrites. Removing the workaround for the rewrites in LLVM,
this can result in shorter BPF programs, less stack usage
and less verification complexity. First patch provides some
more details on rationale and integration.
Thanks a lot!
v1 -> v2:
- Reworded commit msg in patch 1
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 9 Aug 2017 23:40:03 +0000 (01:40 +0200)]
bpf: add test cases for new BPF_J{LT, LE, SLT, SLE} instructions
Add test cases to the verifier selftest suite in order to verify that
i) direct packet access, and ii) dynamic map value access is working
with the changes related to the new instructions.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 9 Aug 2017 23:40:02 +0000 (01:40 +0200)]
bpf: enable BPF_J{LT, LE, SLT, SLE} opcodes in verifier
Enable the newly added jump opcodes, main parts are in two
different areas, namely direct packet access and dynamic map
value access. For the direct packet access, we now allow for
the following two new patterns to match in order to trigger
markings with find_good_pkt_pointers():
Variant 1 (access ok when taking the branch):
0: (61) r2 = *(u32 *)(r1 +76)
1: (61) r3 = *(u32 *)(r1 +80)
2: (bf) r0 = r2
3: (07) r0 += 8
4: (ad) if r0 < r3 goto pc+2
R0=pkt(id=0,off=8,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
R3=pkt_end R10=fp
5: (b7) r0 = 0
6: (95) exit
from 4 to 7: R0=pkt(id=0,off=8,r=8) R1=ctx
R2=pkt(id=0,off=0,r=8) R3=pkt_end R10=fp
7: (71) r0 = *(u8 *)(r2 +0)
8: (05) goto pc-4
5: (b7) r0 = 0
6: (95) exit
processed 11 insns, stack depth 0
Variant 2 (access ok on fall-through):
0: (61) r2 = *(u32 *)(r1 +76)
1: (61) r3 = *(u32 *)(r1 +80)
2: (bf) r0 = r2
3: (07) r0 += 8
4: (bd) if r3 <= r0 goto pc+1
R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8)
R3=pkt_end R10=fp
5: (71) r0 = *(u8 *)(r2 +0)
6: (b7) r0 = 1
7: (95) exit
from 4 to 6: R0=pkt(id=0,off=8,r=0) R1=ctx
R2=pkt(id=0,off=0,r=0) R3=pkt_end R10=fp
6: (b7) r0 = 1
7: (95) exit
processed 10 insns, stack depth 0
The above two basically just swap the branches where we need
to handle an exception and allow packet access compared to the
two already existing variants for find_good_pkt_pointers().
For the dynamic map value access, we add the new instructions
to reg_set_min_max() and reg_set_min_max_inv() in order to
learn bounds. Verifier test cases for both are added in a
follow-up patch.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 9 Aug 2017 23:40:01 +0000 (01:40 +0200)]
bpf, nfp: implement jiting of BPF_J{LT,LE}
This work implements jiting of BPF_J{LT,LE} instructions with
BPF_X/BPF_K variants for the nfp eBPF JIT. The two BPF_J{SLT,SLE}
instructions have not been added yet given BPF_J{SGT,SGE} are
not supported yet either.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 9 Aug 2017 23:40:00 +0000 (01:40 +0200)]
bpf, ppc64: implement jiting of BPF_J{LT, LE, SLT, SLE}
This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the ppc64 eBPF JIT.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Tested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 9 Aug 2017 23:39:59 +0000 (01:39 +0200)]
bpf, s390x: implement jiting of BPF_J{LT, LE, SLT, SLE}
This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the s390x eBPF JIT.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 9 Aug 2017 23:39:58 +0000 (01:39 +0200)]
bpf, sparc64: implement jiting of BPF_J{LT, LE, SLT, SLE}
This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the sparc64 eBPF JIT.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 9 Aug 2017 23:39:57 +0000 (01:39 +0200)]
bpf, arm64: implement jiting of BPF_J{LT, LE, SLT, SLE}
This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the arm64 eBPF JIT.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 9 Aug 2017 23:39:56 +0000 (01:39 +0200)]
bpf, x86: implement jiting of BPF_J{LT,LE,SLT,SLE}
This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the x86_64 eBPF JIT.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 9 Aug 2017 23:39:55 +0000 (01:39 +0200)]
bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 9 Aug 2017 23:49:17 +0000 (16:49 -0700)]
Merge branch 'net-zerocopy-fixes'
Willem de Bruijn says:
====================
net: zerocopy fixes
Fix two issues introduced in the msg_zerocopy patchset.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Wed, 9 Aug 2017 23:09:44 +0000 (19:09 -0400)]
sock: fix zerocopy_success regression with msg_zerocopy
Do not use uarg->zerocopy outside msg_zerocopy. In other paths the
field is not explicitly initialized and aliases another field.
Those paths have only one reference so do not need this intermediate
variable. Call uarg->callback directly.
Fixes:
1f8b977ab32d ("sock: enable MSG_ZEROCOPY")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Wed, 9 Aug 2017 23:09:43 +0000 (19:09 -0400)]
sock: fix zerocopy panic in mem accounting
Only call mm_unaccount_pinned_pages when releasing a struct ubuf_info
that has initialized its field uarg->mmp.
Before this patch, a vhost-net with experimental_zcopytx can crash in
mm_unaccount_pinned_pages
sock_zerocopy_put
skb_zcopy_clear
skb_release_data
Only sock_zerocopy_alloc initializes this field. Move the unaccount
call from generic sock_zerocopy_put to its specific callback
sock_zerocopy_callback.
Fixes:
a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 9 Aug 2017 23:47:19 +0000 (16:47 -0700)]
Merge branch '1GbE' of git://git./linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
1GbE Intel Wired LAN Driver Updates 2017-08-08
This series contains updates to e1000e and igb/igbvf.
Gangfeng Huang fixes an issue with receive network flow classification,
where igb_nfc_filter_exit() was not being called in igb_down() which
would cause the filter tables to "fill up" if a user where to change
the adapter settings (such as speed) which requires a reset of the
adapter.
Cliff Spradlin fixes a timestamping issue, where igb was allowing requests
for hardware timestamping even if it was not configured for hardware
transmit timestamping.
Corinna Vinschen removes the error message that there was an "unexpected
SYS WRAP", when it is actually expected. So remove the message to not
confuse users.
Greg Edwards provides several patches for the mailbox interface between
the PF and VF drivers. Added a mailbox unlock method to be used to unlock
the PF/VF mailbox by the PF. Added a lock around the VF mailbox ops to
prevent the VF from sending another message while the PF is still
processing the previous message. Fixed a "scheduling while atomic" issue
by changing msleep() to mdelay().
Sasha adds support for the next LOM generations i219 (v8 & v9) which
will be available in the next Intel client platform IceLake.
John Linville adds support for a Broadcom PHY to the igb driver, since
there are designs out in the world which use the igb MAC and a third
party PHY. This allows the driver to load and function as expected on
these designs.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 9 Aug 2017 23:28:45 +0000 (16:28 -0700)]
Merge git://git./linux/kernel/git/davem/net
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.
The TCP conflict was a case of local variables in a function
being removed from both net and net-next.
In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 9 Aug 2017 21:30:34 +0000 (14:30 -0700)]
Merge tag 'pinctrl-v4.13-2' of git://git./linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
"These are the pin control fixes I have gathered since the return from
my vacation. They boiled in -next a while so let's get them in.
Apart from the documentation build it is purely driver fixes. Which is
nice. The Intel fixes seem kind of important.
- Fix the documentation build as the docs were moved
- Correct the UART pin list on the Intel Merrifield
- Fix pin assignment and number of pins on the Marvell Armada 37xx
pin controller
- Cover the Setzer models in the Chromebook DMI quirk in the Intel
cheryview driver so they start working
- Add the missing "sim" function to the sunxi driver
- Fix USB pin definitions on Uniphier Pro4
- Smatch fix for invalid reference in the zx pin control driver"
* tag 'pinctrl-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: generic: update references to Documentation/pinctrl.txt
pinctrl: intel: merrifield: Correct UART pin lists
pinctrl: armada-37xx: Fix number of pin in south bridge
pinctrl: armada-37xx: Fix the pin 23 on south bridge
pinctrl: cherryview: Add Setzer models to the Chromebook DMI quirk
pinctrl: sunxi: add a missing function of A10/A20 pinctrl driver
pinctrl: uniphier: fix USB3 pin assignment for Pro4
pinctrl: zte: fix dereference of 'data' in zx_set_mux()
Mel Gorman [Wed, 9 Aug 2017 07:27:11 +0000 (08:27 +0100)]
futex: Remove unnecessary warning from get_futex_key
Commit
65d8fc777f6d ("futex: Remove requirement for lock_page() in
get_futex_key()") removed an unnecessary lock_page() with the
side-effect that page->mapping needed to be treated very carefully.
Two defensive warnings were added in case any assumption was missed and
the first warning assumed a correct application would not alter a
mapping backing a futex key. Since merging, it has not triggered for
any unexpected case but Mark Rutland reported the following bug
triggering due to the first warning.
kernel BUG at kernel/futex.c:679!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted
4.13.0-rc3-00020-g307fec773ba3 #3
Hardware name: linux,dummy-virt (DT)
task:
ffff80001e271780 task.stack:
ffff000010908000
PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
pc : [<
ffff00000821ac14>] lr : [<
ffff00000821ac14>] pstate:
80000145
The fact that it's a bug instead of a warning was due to an unrelated
arm64 problem, but the warning itself triggered because the underlying
mapping changed.
This is an application issue but from a kernel perspective it's a
recoverable situation and the warning is unnecessary so this patch
removes the warning. The warning may potentially be triggered with the
following test program from Mark although it may be necessary to adjust
NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the
system.
#include <linux/futex.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/time.h>
#include <unistd.h>
#define NR_FUTEX_THREADS 16
pthread_t threads[NR_FUTEX_THREADS];
void *mem;
#define MEM_PROT (PROT_READ | PROT_WRITE)
#define MEM_SIZE 65536
static int futex_wrapper(int *uaddr, int op, int val,
const struct timespec *timeout,
int *uaddr2, int val3)
{
syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
}
void *poll_futex(void *unused)
{
for (;;) {
futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1);
}
}
int main(int argc, char *argv[])
{
int i;
mem = mmap(NULL, MEM_SIZE, MEM_PROT,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
printf("Mapping @ %p\n", mem);
printf("Creating futex threads...\n");
for (i = 0; i < NR_FUTEX_THREADS; i++)
pthread_create(&threads[i], NULL, poll_futex, NULL);
printf("Flipping mapping...\n");
for (;;) {
mmap(mem, MEM_SIZE, MEM_PROT,
MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
}
return 0;
}
Reported-and-tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org # 4.7+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 9 Aug 2017 20:21:28 +0000 (13:21 -0700)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"The main thing is to allow empty id_tables for ACPI to make some
drivers get probed again. It looks a bit bigger than usual because it
needs some internal renaming, too.
Other than that, there is a fix for broken DSTDs, a super simple
enablement for ARM MPS, and two documentation fixes which I'd like to
see in v4.13 already"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: rephrase explanation of I2C_CLASS_DEPRECATED
i2c: allow i2c-versatile for ARM MPS platforms
i2c: designware: Some broken DSTDs use 1MiHz instead of 1MHz
i2c: designware: Print clock freq on invalid clock freq error
i2c: core: Allow empty id_table in ACPI case as well
i2c: mux: pinctrl: mention correct module name in Kconfig help text