David S. Miller [Mon, 2 Mar 2015 21:43:46 +0000 (16:43 -0500)]
Merge branch 'neigh_cleanups'
Eric W. Biederman says:
====================
Neighbour table and ax25 cleanups
While looking at the neighbour table to what it would take to allow
using next hops in a different address family than the current packets
I found a partial resolution for my issues and I stumbled upon some
work that makes the neighbour table code easier to understand and
maintain.
Long ago in a much younger kernel ax25 found a hack to use
dev_rebuild_header to transmit it's packets instead of going through
what today is ndo_start_xmit.
When the neighbour table was rewritten into it's current form the ax25
code was such a challenge that arp_broken_ops appeard in arp.c and
neigh_compat_output appeared in neighbour.c to keep the ax25 hack alive.
With a little bit of work I was able to remove some of the hack that
is the ax25 transmit path for ip packets and to isolate what remains
into a slightly more readable piece of code in ax25_ip.c. Removing the
need for the generic code to worry about ax25 special cases.
After cleaning up the old ax25 hacks I also performed a little bit of
work on neigh_resolve_output to remove the need for a dst entry and to
ensure cached headers get a deterministic protocol value in their cached
header. This guarantees that a cached header will not be different
depending on which protocol of packet is transmitted, and it allows
packets to be transmitted that don't have a dst entry. There remains
a small amount of code that takes advantage of when packets have a dst
entry but that is something different.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Mon, 2 Mar 2015 06:14:14 +0000 (00:14 -0600)]
neigh: Don't require a dst in neigh_resolve_output
Having a dst helps a little bit for teql but is fundamentally
unnecessary and there are code paths where a dst is not available that
it would be nice to use the neighbour cache.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Mon, 2 Mar 2015 06:13:22 +0000 (00:13 -0600)]
neigh: Don't require dst in neigh_hh_init
- Add protocol to neigh_tbl so that dst->ops->protocol is not needed
- Acquire the device from neigh->dev
This results in a neigh_hh_init that will cache the samve values
regardless of the packets flowing through it.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Mon, 2 Mar 2015 06:12:05 +0000 (00:12 -0600)]
arp: Kill arp_find
There are no more callers so kill this function.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Mon, 2 Mar 2015 06:11:09 +0000 (00:11 -0600)]
net: Kill dev_rebuild_header
Now that there are no more users kill dev_rebuild_header and all of it's
implementations.
This is long overdue.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Mon, 2 Mar 2015 06:09:42 +0000 (00:09 -0600)]
ax25: Stop depending on arp_find
Have ax25_neigh_output perform ordinary arp resolution before calling
ax25_neigh_xmit.
Call dev_hard_header in ax25_neigh_output with a destination address so
it will not fail, and the destination mac address will not need to be
set in ax25_neigh_xmit.
Remove arp_find from ax25_neigh_xmit (the ordinary arp resolution added
to ax25_neigh_output removes the need for calling arp_find).
Document how close ax25_neigh_output is to neigh_resolve_output.
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Mon, 2 Mar 2015 06:08:43 +0000 (00:08 -0600)]
ax25: Stop calling/abusing dev_rebuild_header
- Rename ax25_rebuild_header to ax25_neigh_xmit and call it from
ax25_neigh_output directly. The rename is to make it clear
that this is not a rebuild_header operation.
- Remove ax25_rebuild_header from ax25_header_ops.
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Mon, 2 Mar 2015 06:07:37 +0000 (00:07 -0600)]
neigh: Move neigh_compat_output into ax25_ip.c
The only caller is now is ax25_neigh_construct so move
neigh_compat_output into ax25_ip.c make it static and rename it
ax25_neigh_output.
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Mon, 2 Mar 2015 06:06:31 +0000 (00:06 -0600)]
arp: Remove special case to give AX25 it's open arp operations.
The special case has been pushed out into ax25_neigh_construct so there
is no need to keep this code in arp.c
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Mon, 2 Mar 2015 06:05:28 +0000 (00:05 -0600)]
ax25: Refactor to use private neighbour operations.
AX25 already has it's own private arp cache operations to isolate
it's abuse of dev_rebuild_header to transmit packets. Add a function
ax25_neigh_construct that will allow all of the ax25 devices to
force using these operations, so that the generic arp code does
not need to.
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Mon, 2 Mar 2015 06:04:31 +0000 (00:04 -0600)]
ax25: Make ax25_header and ax25_rebuild_header static
The only user is in ax25_ip.c so stop exporting these functions.
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Mon, 2 Mar 2015 06:03:45 +0000 (00:03 -0600)]
ax25/6pack: Replace sp_header_ops with ax25_header_ops
The two sets of header operations are functionally identical remove
the duplicate definition.
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Mon, 2 Mar 2015 06:03:02 +0000 (00:03 -0600)]
ax25/kiss: Replace ax_header_ops with ax25_header_ops
The two sets of header operations are functionally identical remove the
duplicate definition.
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Mon, 2 Mar 2015 06:02:19 +0000 (00:02 -0600)]
rose: Transmit packets in rose_xmit not rose_rebuild_header
Patterned after the similar code in net/rom this turns out
to be a trivial obviously correct transmformation.
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Mon, 2 Mar 2015 06:01:30 +0000 (00:01 -0600)]
rose: Set the destination address in rose_header
Not setting the destination address is a bug that I suspect causes no
problems today, as only the arp code seems to call dev_hard_header and
the description I have of rose is that it is expected to be used with a
static neigbour table.
I have derived the offset and the length of the rose destination address
from rose_rebuild_header where arp_find calls neigh_ha_snapshot to set
the destination address.
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Mon, 2 Mar 2015 05:59:57 +0000 (23:59 -0600)]
ax25: In ax25_rebuild_header add missing kfree_skb
In the unlikely (impossible?) event that we attempt to transmit
an ax25 packet over a non-ax25 device free the skb so we don't
leak it.
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Mon, 2 Mar 2015 14:21:55 +0000 (15:21 +0100)]
ebpf: move CONFIG_BPF_SYSCALL-only function declarations
Masami noted that it would be better to hide the remaining CONFIG_BPF_SYSCALL-only
function declarations within the BPF header ifdef, w/o else path dummy alternatives
since these functions are not supposed to have a user outside of CONFIG_BPF_SYSCALL.
Suggested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reference: http://article.gmane.org/gmane.linux.kernel.api/8658
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 2 Mar 2015 19:55:05 +0000 (14:55 -0500)]
Merge git://git./linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
A small batch with accumulated updates in nf-next, mostly IPVS updates,
they are:
1) Add 64-bits stats counters to IPVS, from Julian Anastasov.
2) Move NETFILTER_XT_MATCH_ADDRTYPE out of NETFILTER_ADVANCED as docker
seem to require this, from Anton Blanchard.
3) Use boolean instead of numeric value in set_match_v*(), from
coccinelle via Fengguang Wu.
4) Allows rescheduling of new connections in IPVS when port reuse is
detected, from Marcelo Ricardo Leitner.
5) Add missing bits to support arptables extensions from nft_compat,
from Arturo Borrero.
Patrick is preparing a large batch to enhance the set infrastructure,
named expressions among other things, that should follow up soon after
this batch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Mon, 2 Mar 2015 11:25:51 +0000 (12:25 +0100)]
filter: refactor common filter attach code into __sk_attach_prog
Both sk_attach_filter() and sk_attach_bpf() are setting up sk_filter,
charging skmem and attaching it to the socket after we got the eBPF
prog up and ready. Lets refactor that into a common helper.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 2 Mar 2015 19:47:12 +0000 (14:47 -0500)]
Merge branch 'for-upstream' of git://git./linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-03-02
Here's the first bluetooth-next pull request targeting the 4.1 kernel:
- ieee802154/6lowpan cleanups
- SCO routing to host interface support for the btmrvl driver
- AMP code cleanups
- Fixes to AMP HCI init sequence
- Refactoring of the HCI callback mechanism
- Added shutdown routine for Intel controllers in the btusb driver
- New config option to enable/disable Bluetooth debugfs information
- Fix for early data reception on L2CAP fixed channels
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 2 Mar 2015 18:06:38 +0000 (13:06 -0500)]
Merge branch 'sendmsg_recvmsg_iocb_removal'
Ying Xue says:
====================
net: Remove iocb argument from sendmsg and recvmsg
Currently there is only one user - TIPC whose sendmsg() instances
using iocb argument. Meanwhile, there is no user using iocb argument
in its recvmsg() instance. Therefore, if we eliminate the werid usage
of iobc argument from TIPC, the iocb argument can be removed from
all sendmsg() and recvmsg() instances of the whole networking stack.
Reference:
https://patchwork.ozlabs.org/patch/433960/
Changes:
v2:
* Fix compile errors of DCCP module pointed by David
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Mon, 2 Mar 2015 07:37:48 +0000 (15:37 +0800)]
net: Remove iocb argument from sendmsg and recvmsg
After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.
Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Mon, 2 Mar 2015 07:37:47 +0000 (15:37 +0800)]
tipc: Don't use iocb argument in socket layer
Currently the iocb argument is used to idenfiy whether or not socket
lock is hold before tipc_sendmsg()/tipc_send_stream() is called. But
this usage prevents iocb argument from being dropped through sendmsg()
at socket common layer. Therefore, in the commit we introduce two new
functions called __tipc_sendmsg() and __tipc_send_stream(). When they
are invoked, it assumes that their callers have taken socket lock,
thereby avoiding the weird usage of iocb argument.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arturo Borrero [Mon, 16 Feb 2015 10:32:28 +0000 (11:32 +0100)]
netfilter: nft_compat: add support for arptables extensions
This patch adds support to arptables extensions from nft_compat.
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
David S. Miller [Mon, 2 Mar 2015 05:19:35 +0000 (00:19 -0500)]
Merge branch 'dropcount'
Eyal Birger says:
====================
net: move skb->dropcount to skb->cb[]
Commit
977750076d98 ("af_packet: add interframe drop cmsg (v6)")
unionized skb->mark and skb->dropcount in order to allow recording
of the socket drop count while maintaining struct sk_buff size.
skb->dropcount was introduced since there was no available room
in skb->cb[] in packet sockets. However, its introduction led to
the inability to export skb->mark to userspace.
It was considered to alias skb->priority instead of skb->mark.
However, that would lead to the inabilty to export skb->priority
to userspace if desired. Such change may also lead to hard-to-find
issues as skb->priority is assumed to be alias free, and, as noted
by Shmulik Ladkani, is not 'naturally orthogonal' with other skb
fields.
This patch series follows the suggestions made by Eric Dumazet moving
the dropcount metric to skb->cb[], eliminating this problem
at the expense of 4 bytes less in skb->cb[] for protocol families
using it.
The patch series include compactization of bluetooth and packet
use of skb->cb[] as well as the infrastructure for placing dropcount
in skb->cb[].
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eyal Birger [Sun, 1 Mar 2015 12:58:31 +0000 (14:58 +0200)]
net: move skb->dropcount to skb->cb[]
Commit
977750076d98 ("af_packet: add interframe drop cmsg (v6)")
unionized skb->mark and skb->dropcount in order to allow recording
of the socket drop count while maintaining struct sk_buff size.
skb->dropcount was introduced since there was no available room
in skb->cb[] in packet sockets. However, its introduction led to
the inability to export skb->mark, or any other aliased field to
userspace if so desired.
Moving the dropcount metric to skb->cb[] eliminates this problem
at the expense of 4 bytes less in skb->cb[] for protocol families
using it.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eyal Birger [Sun, 1 Mar 2015 12:58:30 +0000 (14:58 +0200)]
net: add common accessor for setting dropcount on packets
As part of an effort to move skb->dropcount to skb->cb[], use
a common function in order to set dropcount in struct sk_buff.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eyal Birger [Sun, 1 Mar 2015 12:58:29 +0000 (14:58 +0200)]
net: use common macro for assering skb->cb[] available size in protocol families
As part of an effort to move skb->dropcount to skb->cb[] use a common
macro in protocol families using skb->cb[] for ancillary data to
validate available room in skb->cb[].
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eyal Birger [Sun, 1 Mar 2015 12:58:28 +0000 (14:58 +0200)]
net: packet: use sockaddr_ll fields as storage for skb original length in recvmsg path
As part of an effort to move skb->dropcount to skb->cb[], 4 bytes
of additional room are needed in skb->cb[] in packet sockets.
Store the skb original length in the first two fields of sockaddr_ll
(sll_family and sll_protocol) as they can be derived from the skb when
needed.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eyal Birger [Sun, 1 Mar 2015 12:58:27 +0000 (14:58 +0200)]
net: rxrpc: change call to sock_recv_ts_and_drops() on rxrpc recvmsg to sock_recv_timestamp()
Commit
3b885787ea4112 ("net: Generalize socket rx gap / receive queue overflow cmsg")
allowed receiving packet dropcount information as a socket level option.
RXRPC sockets recvmsg function was changed to support this by calling
sock_recv_ts_and_drops() instead of sock_recv_timestamp().
However, protocol families wishing to receive dropcount should call
sock_queue_rcv_skb() or set the dropcount specifically (as done
in packet_rcv()). This was not done for rxrpc and thus this feature
never worked on these sockets.
Formalizing this by not calling sock_recv_ts_and_drops() in rxrpc as
part of an effort to move skb->dropcount into skb->cb[]
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eyal Birger [Sun, 1 Mar 2015 12:58:26 +0000 (14:58 +0200)]
net: bluetooth: compact struct bt_skb_cb by converting boolean fields to bit fields
Convert boolean fields incoming and req_start to bit fields and move
force_active in order save space in bt_skb_cb in an effort to use
a portion of skb->cb[] for storing skb->dropcount.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eyal Birger [Sun, 1 Mar 2015 12:58:25 +0000 (14:58 +0200)]
net: bluetooth: compact struct bt_skb_cb by inlining struct hci_req_ctrl
struct hci_req_ctrl is never used outside of struct bt_skb_cb;
Inlining it frees 8 bytes on a 64 bit system in skb->cb[] allowing
the addition of more ancillary data.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Farnsworth [Sun, 1 Mar 2015 10:54:39 +0000 (10:54 +0000)]
pppoe: Use workqueue to die properly when a PADT is received
When a PADT frame is received, the socket may not be in a good state to
close down the PPP interface. The current implementation handles this by
simply blocking all further PPP traffic, and hoping that the lack of traffic
will trigger the user to investigate.
Use schedule_work to get to a process context from which we clear down the
PPP interface, in a fashion analogous to hangup on a TTY-based PPP
interface. This causes pppd to disconnect immediately, and allows tools to
take immediate corrective action.
Note that pppd's rp_pppoe.so plugin has code in it to disable the session
when it disconnects; however, as a consequence of this patch, the session is
already disabled before rp_pppoe.so is asked to disable the session. The
result is a harmless error message:
Failed to disconnect PPPoE socket: 114 Operation already in progress
This message is safe to ignore, as long as the error is 114 Operation
already in progress; in that specific case, it means that the PPPoE session
has already been disabled before pppd tried to disable it.
Signed-off-by: Simon Farnsworth <simon@farnz.org.uk>
Tested-by: Dan Williams <dcbw@redhat.com>
Tested-by: Christoph Schulz <develop@kristov.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 26 Feb 2015 13:48:07 +0000 (14:48 +0100)]
bnx2: disable toggling of rxvlan if necessary
The bnx2 driver uses .ndo_fix_features to force enable of Rx VLAN tag
stripping when the card cannot disable it. The driver should remove
NETIF_F_HW_VLAN_CTAG_RX flag from hw_features instead so it is fixed
for the ethtool.
Cc: Sony Chacko <sony.chacko@qlogic.com>
Cc: Dept-HSGLinuxNICDev@qlogic.com
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arun Chandran [Sun, 1 Mar 2015 06:08:03 +0000 (11:38 +0530)]
net: macb: Properly add DMACFG bit definitions
Add *_SIZE macros for the bits ENDIA_DESC and
ENDIA_PKT
Signed-off-by: Arun Chandran <achandran@mvista.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arun Chandran [Sun, 1 Mar 2015 06:08:02 +0000 (11:38 +0530)]
net: macb: Add on the fly CPU endianness detection
Program management descriptor's access mode according to the
dynamically detected CPU endianness.
Signed-off-by: Arun Chandran <achandran@mvista.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Tested-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shrikrishna Khare [Sun, 1 Mar 2015 04:33:09 +0000 (20:33 -0800)]
Driver: Vmxnet3: Copy TCP header to mapped frame for IPv6 packets
Allows for packet parsing to be done by the fast path. This performance
optimization already exists for IPv4. Add similar logic for IPv6.
Signed-off-by: Amitabha Banerjee <banerjeea@vmware.com>
Signed-off-by: Shrikrishna Khare <skhare@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 1 Mar 2015 19:05:24 +0000 (14:05 -0500)]
Merge branch 'ebpf_support_for_cls_bpf'
Daniel Borkmann says:
====================
eBPF support for cls_bpf
This is the non-RFC version of my patchset posted before netdev01 [1]
conference. It contains a couple of eBPF cleanups and preparation
patches to get eBPF support into cls_bpf. The last patch adds the
actual support. I'll post the iproute2 parts after the kernel bits
are merged, an initial preview link to the code is mentioned in the
last patch.
Patch 4 and 5 were originally one patch, but I've split them into
two parts upon request as patch 4 only is also needed for Alexei's
tracing patches that go via tip tree.
Tested with tc and all in-kernel available BPF test suites.
I have configured and built LLVM with --enable-experimental-targets=BPF
but as Alexei put it, the plan is to get rid of the experimental
status in future [2].
Thanks a lot!
v1 -> v2:
- Removed arch patches from this series
- x86 is already queued in tip tree, under x86/mm
- arm64 just reposted directly to arm folks
- Rest is unchanged
[1] http://thread.gmane.org/gmane.linux.network/350191
[2] http://article.gmane.org/gmane.linux.kernel/
1874969
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Sun, 1 Mar 2015 11:31:48 +0000 (12:31 +0100)]
cls_bpf: add initial eBPF support for programmable classifiers
This work extends the "classic" BPF programmable tc classifier by
extending its scope also to native eBPF code!
This allows for user space to implement own custom, 'safe' C like
classifiers (or whatever other frontend language LLVM et al may
provide in future), that can then be compiled with the LLVM eBPF
backend to an eBPF elf file. The result of this can be loaded into
the kernel via iproute2's tc. In the kernel, they can be JITed on
major archs and thus run in native performance.
Simple, minimal toy example to demonstrate the workflow:
#include <linux/ip.h>
#include <linux/if_ether.h>
#include <linux/bpf.h>
#include "tc_bpf_api.h"
__section("classify")
int cls_main(struct sk_buff *skb)
{
return (0x800 << 16) | load_byte(skb, ETH_HLEN + __builtin_offsetof(struct iphdr, tos));
}
char __license[] __section("license") = "GPL";
The classifier can then be compiled into eBPF opcodes and loaded
via tc, for example:
clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o
tc filter add dev em1 parent 1: bpf cls.o [...]
As it has been demonstrated, the scope can even reach up to a fully
fledged flow dissector (similarly as in samples/bpf/sockex2_kern.c).
For tc, maps are allowed to be used, but from kernel context only,
in other words, eBPF code can keep state across filter invocations.
In future, we perhaps may reattach from a different application to
those maps e.g., to read out collected statistics/state.
Similarly as in socket filters, we may extend functionality for eBPF
classifiers over time depending on the use cases. For that purpose,
cls_bpf programs are using BPF_PROG_TYPE_SCHED_CLS program type, so
we can allow additional functions/accessors (e.g. an ABI compatible
offset translation to skb fields/metadata). For an initial cls_bpf
support, we allow the same set of helper functions as eBPF socket
filters, but we could diverge at some point in time w/o problem.
I was wondering whether cls_bpf and act_bpf could share C programs,
I can imagine that at some point, we introduce i) further common
handlers for both (or even beyond their scope), and/or if truly needed
ii) some restricted function space for each of them. Both can be
abstracted easily through struct bpf_verifier_ops in future.
The context of cls_bpf versus act_bpf is slightly different though:
a cls_bpf program will return a specific classid whereas act_bpf a
drop/non-drop return code, latter may also in future mangle skbs.
That said, we can surely have a "classify" and "action" section in
a single object file, or considered mentioned constraint add a
possibility of a shared section.
The workflow for getting native eBPF running from tc [1] is as
follows: for f_bpf, I've added a slightly modified ELF parser code
from Alexei's kernel sample, which reads out the LLVM compiled
object, sets up maps (and dynamically fixes up map fds) if any, and
loads the eBPF instructions all centrally through the bpf syscall.
The resulting fd from the loaded program itself is being passed down
to cls_bpf, which looks up struct bpf_prog from the fd store, and
holds reference, so that it stays available also after tc program
lifetime. On tc filter destruction, it will then drop its reference.
Moreover, I've also added the optional possibility to annotate an
eBPF filter with a name (e.g. path to object file, or something
else if preferred) so that when tc dumps currently installed filters,
some more context can be given to an admin for a given instance (as
opposed to just the file descriptor number).
Last but not least, bpf_prog_get() and bpf_prog_put() needed to be
exported, so that eBPF can be used from cls_bpf built as a module.
Thanks to
60a3b2253c41 ("net: bpf: make eBPF interpreter images
read-only") I think this is of no concern since anything wanting to
alter eBPF opcode after verification stage would crash the kernel.
[1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Sun, 1 Mar 2015 11:31:47 +0000 (12:31 +0100)]
ebpf: move read-only fields to bpf_prog and shrink bpf_prog_aux
is_gpl_compatible and prog_type should be moved directly into bpf_prog
as they stay immutable during bpf_prog's lifetime, are core attributes
and they can be locked as read-only later on via bpf_prog_select_runtime().
With a bit of rearranging, this also allows us to shrink bpf_prog_aux
to exactly 1 cacheline.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Sun, 1 Mar 2015 11:31:46 +0000 (12:31 +0100)]
ebpf: add sched_cls_type and map it to sk_filter's verifier ops
As discussed recently and at netconf/netdev01, we want to prevent making
bpf_verifier_ops registration available for modules, but have them at a
controlled place inside the kernel instead.
The reason for this is, that out-of-tree modules can go crazy and define
and register any verfifier ops they want, doing all sorts of crap, even
bypassing available GPLed eBPF helper functions. We don't want to offer
such a shiny playground, of course, but keep strict control to ourselves
inside the core kernel.
This also encourages us to design eBPF user helpers carefully and
generically, so they can be shared among various subsystems using eBPF.
For the eBPF traffic classifier (cls_bpf), it's a good start to share
the same helper facilities as we currently do in eBPF for socket filters.
That way, we have BPF_PROG_TYPE_SCHED_CLS look like it's own type, thus
one day if there's a good reason to diverge the set of helper functions
from the set available to socket filters, we keep ABI compatibility.
In future, we could place all bpf_prog_type_list at a central place,
perhaps.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Sun, 1 Mar 2015 11:31:45 +0000 (12:31 +0100)]
ebpf: remove CONFIG_BPF_SYSCALL ifdefs in socket filter code
This gets rid of CONFIG_BPF_SYSCALL ifdefs in the socket filter code,
now that the BPF internal header can deal with it.
While going over it, I also changed eBPF related functions to a sk_filter
prefix to be more consistent with the rest of the file.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Sun, 1 Mar 2015 11:31:44 +0000 (12:31 +0100)]
ebpf: make internal bpf API independent of CONFIG_BPF_SYSCALL ifdefs
Socket filter code and other subsystems with upcoming eBPF support should
not need to deal with the fact that we have CONFIG_BPF_SYSCALL defined or
not.
Having the bpf syscall as a config option is a nice thing and I'd expect
it to stay that way for expert users (I presume one day the default setting
of it might change, though), but code making use of it should not care if
it's actually enabled or not.
Instead, hide this via header files and let the rest deal with it.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Sun, 1 Mar 2015 11:31:43 +0000 (12:31 +0100)]
ebpf: export BPF_PSEUDO_MAP_FD to uapi
We need to export BPF_PSEUDO_MAP_FD to user space, as it's used in the
ELF BPF loader where instructions are being loaded that need map fixups.
An initial stage loads all maps into the kernel, and later on replaces
related instructions in the eBPF blob with BPF_PSEUDO_MAP_FD as source
register and the actual fd as immediate value.
The kernel verifier recognizes this keyword and replaces the map fd with
a real pointer internally.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Sun, 1 Mar 2015 11:31:42 +0000 (12:31 +0100)]
ebpf: constify various function pointer structs
We can move bpf_map_ops and bpf_verifier_ops and other structs into ro
section, bpf_map_type_list and bpf_prog_type_list into read mostly.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Sun, 1 Mar 2015 11:31:41 +0000 (12:31 +0100)]
ebpf: remove kernel test stubs
Now that we have BPF_PROG_TYPE_SOCKET_FILTER up and running, we can
remove the test stubs which were added to get the verifier suite up.
We can just let the test cases probe under socket filter type instead.
In the fill/spill test case, we cannot (yet) access fields from the
context (skb), but we may adapt that test case in future.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 1 Mar 2015 04:39:05 +0000 (23:39 -0500)]
Merge branch 's390-next'
Ursula Braun says:
====================
s390: network patches for net-next
here are some s390 related patches for net-next
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Fri, 27 Feb 2015 11:52:34 +0000 (12:52 +0100)]
MAINTAINERS: update S390 NETWORK DRIVERS maintainer
remove Frank Blaschka as S390 NETWORK DRIVERS maintainer
Acked-by: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Fri, 27 Feb 2015 11:52:33 +0000 (12:52 +0100)]
qeth: Fix command sizes
This patch adjusts two instances where we were using the (too big)
struct qeth_ipacmd_setadpparms size instead of the commands' actual
size. This didn't do any harm, but wasted a few bytes.
Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Fri, 27 Feb 2015 11:52:32 +0000 (12:52 +0100)]
s390: remove claw driver
claw devices are outdated and no longer supported.
This patch removes the claw driver.
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 27 Feb 2015 03:08:59 +0000 (19:08 -0800)]
tcp: cleanup static functions
tcp_fastopen_create_child() is static and should not be exported.
tcp4_gso_segment() and tcp6_gso_segment() should be static.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Schwartzmeyer [Fri, 27 Feb 2015 00:27:14 +0000 (16:27 -0800)]
hyperv: Implement netvsc_get_channels() ethool op
This adds support for reporting the actual and maximum combined channels
count of the hv_netvsc driver via 'ethtool --show-channels'.
This required adding 'max_chn' to 'struct netvsc_device', and assigning
it 'rsscap.num_recv_que' in 'rndis_filter_device_add'. Now we can access
the combined maximum channel count via 'struct netvsc_device' in the
ethtool callback.
Signed-off-by: Andrew Schwartzmeyer <andrew@schwartzmeyer.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 28 Feb 2015 20:10:47 +0000 (15:10 -0500)]
Merge branch 'tcp-tso'
Eric Dumazet says:
====================
tcp: tso improvements
This patch serie reworks tcp_tso_should_defer() a bit
to get less bursts, and better ECN behavior.
We also removed tso_deferred field in tcp socket.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 26 Feb 2015 22:10:20 +0000 (14:10 -0800)]
tcp: tso: allow CA_CWR state in tcp_tso_should_defer()
Another TCP issue is triggered by ECN.
Under pressure, receiver gets ECN marks, and send back ACK packets
with ECE TCP flag. Senders enter CA_CWR state.
In this state, tcp_tso_should_defer() is short cut :
if (icsk->icsk_ca_state != TCP_CA_Open)
goto send_now;
This means that about all ACK packets we receive are triggering
a partial send, and because cwnd is kept small, we can only send
a small amount of data for each incoming ACK,
which in return generate more ACK packets.
Allowing CA_Open and CA_CWR states to enable TSO defer in
tcp_tso_should_defer() brings performance back :
TSO autodefer has more chance to defer under pressure.
This patch increases TSO and LRO/GRO efficiency back to normal levels,
and does not impact overall ECN behavior.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 26 Feb 2015 22:10:19 +0000 (14:10 -0800)]
tcp: tso: restore IW10 after TSO autosizing
With sysctl_tcp_min_tso_segs being 4, it is very possible
that tcp_tso_should_defer() decides not sending last 2 MSS
of initial window of 10 packets. This also applies if
autosizing decides to send X MSS per GSO packet, and cwnd
is not a multiple of X.
This patch implements an heuristic based on age of first
skb in write queue : If it was sent very recently (less than half srtt),
we can predict that no ACK packet will come in less than half rtt,
so deferring might cause an under utilization of our window.
This is visible on initial send (IW10) on web servers,
but more generally on some RPC, as the last part of the message
might need an extra RTT to get delivered.
Tested:
Ran following packetdrill test
// A simple server-side test that sends exactly an initial window (IW10)
// worth of packets.
`sysctl -e -q net.ipv4.tcp_min_tso_segs=4`
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+.1 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.1 < . 1:1(0) ack 1 win 257
+0 accept(3, ..., ...) = 4
+0 write(4, ..., 14600) = 14600
+0 > . 1:5841(5840) ack 1 win 457
+0 > . 5841:11681(5840) ack 1 win 457
// Following packet should be sent right now.
+0 > P. 11681:14601(2920) ack 1 win 457
+.1 < . 1:1(0) ack 14601 win 257
+0 close(4) = 0
+0 > F. 14601:14601(0) ack 1
+.1 < F. 1:1(0) ack 14602 win 257
+0 > . 14602:14602(0) ack 2
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 26 Feb 2015 22:10:18 +0000 (14:10 -0800)]
tcp: tso: remove tp->tso_deferred
TSO relies on ability to defer sending a small amount of packets.
Heuristic is to wait for future ACKS in hope to send more packets at once.
Current algorithm uses a per socket tso_deferred field as a pseudo timer.
This pseudo timer relies on future ACK, but there is no guarantee
we receive them in time.
Fix would be to use a real timer, but cost of such timer is probably too
expensive for typical cases.
This patch changes the logic to test the time of last transmit,
because we should not add bursts of more than 1ms for any given flow.
We've used this patch for about two years at Google, before FQ/pacing
as it would reduce a fair amount of bursts.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Hutchings [Thu, 26 Feb 2015 19:34:37 +0000 (19:34 +0000)]
usbnet: Fix tx_packets stat for FLAG_MULTI_FRAME drivers
Currently the usbnet core does not update the tx_packets statistic for
drivers with FLAG_MULTI_PACKET and there is no hook in the TX
completion path where they could do this.
cdc_ncm and dependent drivers are bumping tx_packets stat on the
transmit path while asix and sr9800 aren't updating it at all.
Add a packet count in struct skb_data so these drivers can fill it
in, initialise it to 1 for other drivers, and add the packet count
to the tx_packets statistic on completion.
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Tested-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 27 Feb 2015 23:18:52 +0000 (18:18 -0500)]
Merge branch 'tipc-next'
Erik Hugne says:
====================
tipc: bug fix and some improvements
Most important is a fix for a nullptr exception that would occur when
name table subscriptions fail. The remaining patches are performance
improvements and cosmetic changes.
v2: remove unnecessary whitespace in patch #2
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Erik Hugne [Fri, 27 Feb 2015 07:56:58 +0000 (08:56 +0100)]
tipc: make media address offset a common define
With the exception of infiniband media which does not use media
offsets, the media address is always located at offset 4 in the
media info field as defined by the protocol, so we move the
definition to the generic bearer.h
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Erik Hugne [Fri, 27 Feb 2015 07:56:57 +0000 (08:56 +0100)]
tipc: rename media/msg related definitions
The TIPC_MEDIA_ADDR_SIZE and TIPC_MEDIA_ADDR_OFFSET names
are misleading, as they actually define the size and offset of
the whole media info field and not the address part. This patch
does not have any functional changes.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Erik Hugne [Fri, 27 Feb 2015 07:56:56 +0000 (08:56 +0100)]
tipc: purge links when bearer is disabled
If a bearer is disabled by manual intervention, all links over that
bearer should be purged, indicated with the 'shutting_down' flag.
Otherwise tipc will get confused if a new bearer is enabled using
a different media type.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Erik Hugne [Fri, 27 Feb 2015 07:56:55 +0000 (08:56 +0100)]
tipc: fix nullpointer bug when subscribing to events
If a subscription request is sent to a topology server
connection, and any error occurs (malformed request, oom
or limit reached) while processing this request, TIPC should
terminate the subscriber connection. While doing so, it tries
to access fields in an already freed (or never allocated)
subscription element leading to a nullpointer exception.
We fix this by removing the subscr_terminate function and
terminate the connection immediately upon any subscription
failure.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Erik Hugne [Fri, 27 Feb 2015 07:56:54 +0000 (08:56 +0100)]
tipc: only create header copy for name distr messages
The TIPC name distributor pushes topology updates to the cluster
neighbors. Currently this is done in a unicast manner, and the
skb holding the update is cloned for each cluster member. This
is unnecessary, as we only modify the destnode field in the header
so we change it to do pskb_copy instead.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Wed, 25 Feb 2015 18:52:11 +0000 (19:52 +0100)]
team: allow TSO being set on master
This patch allows TSO being set/unset on the master, so that GSO
segmentation is done after team layer.
Similar patch is present for bonding:
b0ce3508b25e ("bonding: allow TSO being set on bonding master")
and bridge:
f902e8812ef6 ("bridge: Add ability to enable TSO")
Suggested-by: Jiri Prochazka <jprochaz@redhat.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 27 Feb 2015 21:37:23 +0000 (16:37 -0500)]
Merge branch 'fib_trie_remove_leaf_info'
Alexander Duyck says:
====================
fib_trie: Remove leaf_info structure
This patch set removes the leaf_info structure from the IPv4 fib_trie. The
general idea is that the leaf_info structure itself only held about 6
actual bits of data, beyond that it was mostly just waste. As such we can
drop the structure, move the 1 byte representing the prefix/suffix length
into the fib_alias and just link it all into one list.
My testing shows that this saves somewhere between 4 to 10ns depending on
the type of test performed. I'm suspecting that this represents 1 to 2 L1
cache misses saved per look-up.
One side effect of this change is that semantic_match_miss will now only
increment once per leaf instead of once per leaf_info miss. However the
stat is already skewed now that we perform a preliminary check on the leaf
as a part of the look-up.
I also have gone through and addressed a number of ordering issues in the
first patch since I had misread the behavior of list_add_tail.
I have since run some additional testing and verified the resulting lists
are in the same order when combining multiple prefix length and tos values
in a single leaf.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Wed, 25 Feb 2015 23:31:51 +0000 (15:31 -0800)]
fib_trie: Remove leaf_info
At this point the leaf_info hash is redundant. By adding the suffix length
to the fib_alias hash list we no longer have need of leaf_info as we can
determine the prefix length from fa_slen. So we can compress things by
dropping the leaf_info structure from fib_trie and instead directly connect
the leaves to the fib_alias hash list.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Wed, 25 Feb 2015 23:31:44 +0000 (15:31 -0800)]
fib_trie: Add slen to fib alias
Make use of an empty spot in the alias to store the suffix length so that
we don't need to pull that information from the leaf_info structure.
This patch also makes a slight change to the user statistics. Instead of
incrementing semantic_match_miss once per leaf_info miss we now just
increment it once per leaf if a match was not found.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Wed, 25 Feb 2015 23:31:37 +0000 (15:31 -0800)]
fib_trie: Replace plen with slen in leaf_info
This replaces the prefix length variable in the leaf_info structure with a
suffix length value, or host identifier length in bits. By doing this it
makes it easier to sort out since the tnodes and leaf are carrying this
value as well since it is compatible with the ->pos field in tnodes.
I also cleaned up one spot that had some list manipulation that could be
simplified. I basically updated it so that we just use hlist_add_head_rcu
instead of calling hlist_add_before_rcu on the first node in the list.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Wed, 25 Feb 2015 23:31:31 +0000 (15:31 -0800)]
fib_trie: Convert fib_alias to hlist from list
There isn't any advantage to having it as a list and by making it an hlist
we make the fib_alias more compatible with the list_info in terms of the
type of list used.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 27 Feb 2015 21:25:30 +0000 (16:25 -0500)]
Merge branch 'ip_level_multicast_join_leave'
Madhu Challa says:
====================
Multicast group join/leave at ip level
This series enables configuring multicast group join/leave at ip level
by extending the "ip address" command.
It adds a new control socket mc_autojoin_sock and ifa_flag IFA_F_MCAUTOJOIN
to invoke the corresponding igmp group join/leave api.
Since the igmp group join/leave api takes the rtnl_lock the code had to
be refactored by adding a shim layer prefixed by __ that can be invoked
by code that already has the rtnl_lock. This way we avoid proliferation of
work queues.
The first patch in this series does the refactoring for igmp v6.
Its based on igmp v4 changes that were added by Eric Dumazet.
The second patch in this series does the group join/leave based on the
setting of the IFA_F_MCAUTOJOIN flag.
v5:
- addressed comments from Daniel Borkmann.
- removed blank line in patch 1/2
- removed unused variable, const arg in patch 2/2
v4:
- addressed comments from Yoshifuji Hideaki.
- Remove WARN_ON not needed because we return a value from v2.
- addressed comments from Daniel Borkmann.
- rename sock to mc_autojoin_sk
- ip_mc_config() pass ifa so it needs one less argument.
- igmp_net_{init|destroy}() use inet_ctl_sock_{create|destroy}
- inet_rtm_newaddr() change scope of ret.
- igmp_net_init() no need to initialize sock to NULL.
v3:
- addressed comments from David Miller.
- fixed indentation and local variable order.
v2:
- addressed comments from Eric Dumazet.
- removed workqueue and call __ip_mc_{join|leave}_group or
__ipv6_sock_mc_{join|drop}
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Madhu Challa [Wed, 25 Feb 2015 17:58:35 +0000 (09:58 -0800)]
multicast: Extend ip address command to enable multicast group join/leave on
Joining multicast group on ethernet level via "ip maddr" command would
not work if we have an Ethernet switch that does igmp snooping since
the switch would not replicate multicast packets on ports that did not
have IGMP reports for the multicast addresses.
Linux vxlan interfaces created via "ip link add vxlan" have the group option
that enables then to do the required join.
By extending ip address command with option "autojoin" we can get similar
functionality for openvswitch vxlan interfaces as well as other tunneling
mechanisms that need to receive multicast traffic. The kernel code is
structured similar to how the vxlan driver does a group join / leave.
example:
ip address add 224.1.1.10/24 dev eth5 autojoin
ip address del 224.1.1.10/24 dev eth5
Signed-off-by: Madhu Challa <challa@noironetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Madhu Challa [Wed, 25 Feb 2015 17:58:34 +0000 (09:58 -0800)]
igmp v6: add __ipv6_sock_mc_join and __ipv6_sock_mc_drop
Based on the igmp v4 changes from Eric Dumazet.
959d10f6bbf6("igmp: add __ip_mc_{join|leave}_group()")
These changes are needed to perform igmp v6 join/leave while
RTNL is held.
Make ipv6_sock_mc_join and ipv6_sock_mc_drop wrappers around
__ipv6_sock_mc_join and __ipv6_sock_mc_drop to avoid
proliferation of work queues.
Signed-off-by: Madhu Challa <challa@noironetworks.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 24 Feb 2015 17:17:31 +0000 (09:17 -0800)]
udp: In udp_flow_src_port use random hash value if skb_get_hash fails
In the unlikely event that skb_get_hash is unable to deduce a hash
in udp_flow_src_port we use a consistent random value instead.
This is specified in GRE/UDP draft section 3.2.1:
https://tools.ietf.org/html/draft-ietf-tsvwg-gre-in-udp-encap-04
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Aring [Fri, 27 Feb 2015 08:58:30 +0000 (09:58 +0100)]
at86rf230: add warning if edge-triggered irq
While testing I experience a deadlock while using the at86rf233 on a
raspberry pi. The reason was an edge triggered gpio irq because the irq
triggered while irq was disabled. This issue doesn't happend on a level
triggered irq because the irq will hit after calling enable_irq.
This patch adds a warning that it's not recommended to use a edge-triggered
irq type. Also change the examples to high-level irqtype.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Alexander Aring [Fri, 27 Feb 2015 08:58:29 +0000 (09:58 +0100)]
at86rf230: add irq low-level for polarity
The at86rf2xx chips supports the setting of irq polarity if active low
or active high. This patch adds a handling for IRQ_ACTIVE_LOW if the
irq_type is IRQ_TYPE_LEVEL_LOW.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Alexander Aring [Fri, 27 Feb 2015 08:58:28 +0000 (09:58 +0100)]
at86rf230: add irqmask mode setting
Since we support at86rf233 we need to ensure that basic operation
default values are the same. This patch always sets IRQ_MASK_MODE to 0
which is after reset 1 at the at86rf233 and 0 at the at86rf231.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Alexander Aring [Fri, 27 Feb 2015 08:58:27 +0000 (09:58 +0100)]
at86rf230: remove tx_timeout
This patch removes tx_timeout handling. We used it in sync xmit
handling. Since we support async xmit handling a xmit timeout handling
isn't easy to implement and should be implemented by netdev watchdog
mechanism.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Alexander Aring [Fri, 27 Feb 2015 08:58:26 +0000 (09:58 +0100)]
at86rf230: add support for external xtal trim
This patch adds support for setting the xtal trim register. Some at86rf2xx
transceiver boards needs fine tuning the xtal capacitor.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Alexander Aring [Fri, 27 Feb 2015 08:58:25 +0000 (09:58 +0100)]
at86rf230: copy pdata to driver allocated space
This patch copies the platform data in driver allocated space at first.
With this change we ensure that we access the allocated platform data as
readonly space.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reported-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Jiri Slaby [Thu, 19 Feb 2015 14:20:43 +0000 (15:20 +0100)]
Bluetooth: make hci_test_bit's addr const
gcc5 warns about passing a const array to hci_test_bit which takes a
non-const pointer:
net/bluetooth/hci_sock.c: In function ‘hci_sock_sendmsg’:
net/bluetooth/hci_sock.c:955:8: warning: passing argument 2 of ‘hci_test_bit’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-array-qualifiers]
&hci_sec_filter.ocf_mask[ogf])) &&
^
net/bluetooth/hci_sock.c:49:19: note: expected ‘void *’ but argument is of type ‘const __u32 (*)[4] {aka const unsigned int (*)[4]}’
static inline int hci_test_bit(int nr, void *addr)
^
So make 'addr' 'const void *'.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Johan Hedberg [Fri, 27 Feb 2015 08:11:13 +0000 (10:11 +0200)]
Bluetooth: Update New CSRK event to match latest specification
The 'master' parameter of the New CSRK event was recently renamed to
'type', with the old values kept for backwards compatibility as
unauthenticated local/remote keys. This patch updates the code to take
into account the two new (authenticated) values and ensures they get
used based on the security level of the connection that the respective
keys get distributed over.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Roopa Prabhu [Thu, 26 Feb 2015 07:55:40 +0000 (23:55 -0800)]
bridge: fix link notification skb size calculation to include vlan ranges
my previous patch skipped vlan range optimizations during skb size
calculations for simplicity.
This incremental patch considers vlan ranges during
skb size calculations. This leads to a bit of code duplication
in the fill and size calculation functions. But, I could not find a
prettier way to do this. will take any suggestions.
Previously, I had reused the existing br_get_link_af_size size calculation
function to calculate skb size for notifications. Reusing it this time
around creates some change in behaviour issues for the usual
.get_link_af_size callback.
This patch adds a new br_get_link_af_size_filtered() function to
base the size calculation on the incoming filter flag and include
vlan ranges.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 26 Feb 2015 16:22:03 +0000 (11:22 -0500)]
Merge branch 'rocker-next'
Scott Feldman says:
====================
rocker cleanups
Pushing out some rocker cleanups I've had in my queue for a while. Nothing
major, just some sync-up with changes that already went into device code
(hard-coding desc err return values and lport renaming). Also fixup
port fowarding transitions prompted by some DSA discussions about how to
restore port state when port leaves bridge.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Thu, 26 Feb 2015 04:15:38 +0000 (20:15 -0800)]
rocker: put port in FORWADING state after leaving bridge
Cleanup the port forwarding state transitions for the cases when the port
joins or leaves a bridge, or is brought admin UP or DOWN. When port is
bridged, we can rely on bridge driver putting port in correct state using
STP callback into port driver, regardless if bridge is enabled for STP or not.
When port is not bridged, we can reuse some of the STP code to enabled or
disable forwarding depending on UP or DOWN.
Tested by trying all the transitions from bridge/not bridge, and UP/DOWN, and
verifying port is in the correct forwarding state after each transition.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Thu, 26 Feb 2015 04:15:37 +0000 (20:15 -0800)]
rocker: rename lport to pport
This is just a rename of physical ports from "lport" to "pport". Not a
functional change. OF-DPA uses logical ports (lport) for tunnels, but the
driver (and device) were using "lport" for physical ports. Renaming physical
ports references to "pport", freeing up "lport" for use later with tunnels.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Thu, 26 Feb 2015 04:15:36 +0000 (20:15 -0800)]
rocker: fix non-portable err return codes
The rocker device returns error codes if something goes wrong with descriptor
processing. Originally the device used standard errno codes for different
errors, but since those errno codes aren't portable across ARCHs, the device
now returns hard-coded error codes that stay constant across diff ARCHs. Fix
driver to use those same hard-coded values.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 25 Feb 2015 23:13:07 +0000 (18:13 -0500)]
Merge branch 'master' of git://git./linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2015-02-24
This series contains updates to i40e and i40evf only, which bumps their
versions to i40e 1.2.9 and i40evf 1.2.3.
Paul fixes i40e_debug_aq() for big endian machines by adding the
appropriate LExx_TO_CPU wrappers.
Catherine adds a requested speed variable to the link_status to store the
last speeds we requested from the firmware and use the advertised speed
settings in get_settings in ethtool now that we have it. Due to the
new code addition, she also refactors get_settings to improve readability
and to accommodate some of the longer lines of code by adding two
functions i40e_get_settings_link_up() and i40e_get_settings_link_down().
Carolyn adds a struct to the VSI struct to keep track of RXNFC settings
done via ethtool. Adds more information to the interrupt vector
names, specifically to the VF misc vector name so that we can distinguish
between all the interrupts.
Ashish enables the i40evf driver to enable debug prints via ethtool.
Mitch updates i40e to enable packet split only when IOMMU is in use,
since it shows a distinct advantage over the single-buffer path
because it minimizes DMA mapping and unmapping. Also adds the receive
routine in use to the features log message to be able to print the
receive packet split status.
Greg adds the ability to get, set and commit permanently the NPAR
partition BW configuration through configfs. Enables an application
to query the i40e driver's private flags to get the status of NPAR
enablement via ethtool.
Neerav adds support for bridge offload ndo_ops getlink and setlink
to enable bridge hardware mode as per the mode set via IFLA_BRIDGE_MODE.
The support is only enabled in the case of a PF VSI and not available for
any other VSI type.
Kevin fixes i40e by ensuring the BUF and FLAG_RD flags are set for
indirect admin queue command.
Vasu updates the driver to setup FCoE netdev device type as "fcoe", so that
it shows up in sysfs as FCoE device.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Guenter Roeck [Wed, 25 Feb 2015 07:02:02 +0000 (23:02 -0800)]
net: dsa: Introduce dsa_is_port_initialized
To avoid race conditions when using the ds->ports[] array,
we need to check if the accessed port has been initialized.
Introduce and use helper function dsa_is_port_initialized
for that purpose and use it where needed.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 25 Feb 2015 22:04:15 +0000 (17:04 -0500)]
Merge branch 'sf2_hwbridge'
Florian Fainelli says:
====================
net: dsa: integration with SWITCHDEV for HW bridging
This patch set provides the DSA and SWITCHDEV integration bits together and
modifies the bcm_sf2 driver accordingly such that it works properly with HW
bridging.
Changes in v3:
- add back the null pointer check in dsa_slave_br_port_mask from Guenter
- slightly rework patch 1 commit message not to mention the function name
we add in patch 2
Changes in v2:
- avoid a race condition in how DSA network devices are created, patch from
Guenter Roeck
- provide a consistent and work STP state once a port leaves the bridge
- retain a bridge device pointer to properly flag port/bridge membership
- properly flush the ARL (Address Resolution Logic) in bcm_sf2.c
- properly retain port membership when individually bringing devices up/down
while they are members of a bridge
We discussed on the mailing-list the possibility of standardizing a "fdb_flush"
operation for DSA switch drivers, looking at the Marvell and Broadcom switches,
I am not convinced this is practical or diserable as the terminologies vary
here, but there is nothing preventing us from doing it later.
Many thanks to Guenter and Andrew for both testing and providing feedback.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Tue, 24 Feb 2015 21:15:34 +0000 (13:15 -0800)]
net: dsa: bcm_sf2: add HW bridging support
Implement the bridge join, leave and set_stp callbacks by making that
we do the following:
- when a port joins the bridge, all existing ports in the bridge get
their VLAN control register updated with that joining port
- the joining port is including all existing bridge ports in its own
VLAN control register
The leave operation is fairly similar, special care must be taken to
make sure that port leaving the bridging is not removing itself from its
own VLAN control register.
Since the various BR_* states apply directly to our HW semantics, we
just need to translate these constants into their corresponding HW
settings, and voila!
We make sure to trigger a fast-ageing process for ports that are
joining/leaving the bridge and transition from incompatible states, this
is equivalent to triggering an ARL flush for that port.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Tue, 24 Feb 2015 21:15:33 +0000 (13:15 -0800)]
net: dsa: integrate with SWITCHDEV for HW bridging
In order to support bridging offloads in DSA switch drivers, select
NET_SWITCHDEV to get access to the port_stp_update and parent_get_id
NDOs that we are required to implement.
To facilitate the integratation at the DSA driver level, we implement 3
types of operations:
- port_join_bridge
- port_leave_bridge
- port_stp_update
DSA will resolve which switch ports that are currently bridge port
members as some Switch hardware/drivers need to know about that to limit
the register programming to just the relevant registers (especially for
slow MDIO buses).
We also take care of setting the correct STP state when slave network
devices are brought up/down while being bridge members.
Finally, when a port is leaving the bridge, we make sure we set in
BR_STATE_FORWARDING state, otherwise the bridge layer would leave it
disabled as a result of having left the bridge.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Guenter Roeck [Tue, 24 Feb 2015 21:15:32 +0000 (13:15 -0800)]
net: dsa: Ensure that port array elements are initialized before being used
A network device notifier can be called for one or more of the created
slave devices before all slave devices have been registered. This can
result in a mismatch between ds->phys_port_mask and the registered devices
by the time the call is made, and it can result in a slave device being
added to a bridge before its entry in ds->ports[] has been initialized.
Rework the initialization code to initialize entries in ds->ports[] in
dsa_slave_create. With this change, dsa_slave_create no longer needs
to return slave_dev but can return an error code instead.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sravanthi Tangeda [Fri, 6 Feb 2015 08:52:21 +0000 (08:52 +0000)]
i40e/i40evf: Update driver versions
Bump i40e to 1.2.9 and i40evf 1.2.3
Also update the copyright year.
Change-ID: I345d777e94abd0acffe6a28793f675d251a86299
Signed-off-by: Sravanthi Tangeda <sravanthi.tangeda@intel.com>
Tested-by: Jim Young <james.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Carolyn Wyborny [Fri, 6 Feb 2015 08:52:20 +0000 (08:52 +0000)]
i40evf: Add more info to interrupt vector names
This patch adds the netdev name to the VF misc vector name. Without
this patch, all the interrupts show the same info, so it difficult to
distinguish them.
Change-ID: I247828697e1373ecfb5f8dc1bc9618e98a7f4942
Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Tested-by: Jim Young <james.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Greg Rose [Fri, 6 Feb 2015 08:52:19 +0000 (08:52 +0000)]
i40e: Use ethtool private flags to display NPAR status
Allow an application to query the i40e driver's private flags to get the
status of NPAR enablement. This will be used by applications to determine
if there are NPAR specific features available.
Change-ID: Ia6d9477a48f9c4cb41ca022bd433f77da3f2146c
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Jim Young <james.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Kevin Scott [Fri, 6 Feb 2015 08:52:18 +0000 (08:52 +0000)]
i40e: Set FLAG_RD when sending buffer FW must read
Set FLAG_RD for send_driver_version AQ command.
Change-ID: I8253051eff85a1d4b5a4e12ce0395b65ceb91e62
Signed-off-by: Kevin Scott <kevin.c.scott@intel.com>
Tested-by: Jim Young <james.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mitch Williams [Fri, 6 Feb 2015 08:52:17 +0000 (08:52 +0000)]
i40e: print Rx packet split status
Add the RX routine in use to the features log message.
Change-ID: Ifbbf28fb7f42b9a3d2828586488e9e6331107dd5
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Jim Young <james.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Vasu Dev [Mon, 9 Feb 2015 18:00:30 +0000 (18:00 +0000)]
i40e: setup FCoE device type
Setup FCoE netdev device type as "fcoe", so that it shows up in
sysfs as FCoE device.
Change-ID: Ie13a1a332dba4d5802586926104ee01ef20da44f
Signed-off-by: Vasu Dev <vasu.dev@intel.com>
Tested-by: Jim Young <james.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Kevin Scott [Fri, 6 Feb 2015 08:52:15 +0000 (08:52 +0000)]
i40e: Set BUF flag for Set Version AQ command
BUF flag must be set for indirect AQ command.
Change-ID: I6819718a47baf69d1a91ebaed89f735ed6e86025
Signed-off-by: Kevin Scott <kevin.c.scott@intel.com>
Acked-by: Shannon Nelson <shannon.nelson@intel.com>
Tested-by: Jim Young <james.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Neerav Parikh [Fri, 6 Feb 2015 08:52:14 +0000 (08:52 +0000)]
i40e: Add support for getlink, setlink ndo ops
Add support for bridge offload ndo_ops getlink and setlink to
enable bridge hardware mode as per the mode set via IFLA_BRIDGE_MODE.
The support is only enabled in case of a PF VSI and not available for
any other VSI type.
By default the i40e driver inserts a bridge as part of the bring-up
when a FDIR type VSI and/or a FCoE VSI is created. This bridge is
created in VEB mode by default i.e. after creating the bridge using
"Add VEB" AQ command the loopback for the PF's default VSI is enabled.
The patch adds capability where all the VSIs created as downlink to
the bridge inherits the loopback property and enables loopback only
if the uplink bridge is operating in VEB mode.
Hence, there is no need to explicitly enable loopback as part of
allocating resources for SR-IOV VFs and call to do that has been
removed.
In case a user-request is made either via "bridge" utility or using
the bridge netlink interface that requires to change the hardware
bridge mode then that would require a PF reset and rebuild of the
switch hierarchy.
Also update the copyright year.
Change-ID: I4d78fc1c83158efda29ba7be92239b74f75d6d25
Signed-off-by: Neerav Parikh <neerav.parikh@intel.com>
Tested-By: Jim Young <james.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>