Pablo Neira Ayuso [Sun, 27 Nov 2016 23:05:44 +0000 (00:05 +0100)]
netfilter: nf_tables: atomic dump and reset for stateful objects
This patch adds a new NFT_MSG_GETOBJ_RESET command perform an atomic
dump-and-reset of the stateful object. This also comes with add support
for atomic dump and reset for counter and quota objects.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 27 Nov 2016 23:05:52 +0000 (00:05 +0100)]
netfilter: nft_quota: dump consumed quota
Add a new attribute NFTA_QUOTA_CONSUMED that displays the amount of
quota that has been already consumed. This allows us to restore the
internal state of the quota object between reboots as well as to monitor
how wasted it is.
This patch changes the logic to account for the consumed bytes, instead
of the bytes that remain to be consumed.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 27 Nov 2016 23:05:38 +0000 (00:05 +0100)]
netfilter: nf_tables: add stateful object reference expression
This new expression allows us to refer to existing stateful objects from
rules.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 27 Nov 2016 23:04:43 +0000 (00:04 +0100)]
netfilter: nft_quota: add stateful object type
Register a new quota stateful object type into the new stateful object
infrastructure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 27 Nov 2016 23:04:36 +0000 (00:04 +0100)]
netfilter: nft_counter: add stateful object type
Register a new percpu counter stateful object type into the stateful
object infrastructure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 27 Nov 2016 23:04:32 +0000 (00:04 +0100)]
netfilter: nf_tables: add stateful objects
This patch augments nf_tables to support stateful objects. This new
infrastructure allows you to create, dump and delete stateful objects,
that are identified by a user-defined name.
This patch adds the generic infrastructure, follow up patches add
support for two stateful objects: counters and quotas.
This patch provides a native infrastructure for nf_tables to replace
nfacct, the extended accounting infrastructure for iptables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Mon, 28 Nov 2016 10:40:06 +0000 (11:40 +0100)]
netfilter: add and use nf_fwd_netdev_egress
... so we can use current skb instead of working with a clone.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Mon, 28 Nov 2016 10:40:05 +0000 (11:40 +0100)]
netfilter: ingress: translate 0 nf_hook_slow retval to -1
The caller assumes that < 0 means that skb was stolen (or free'd).
All other return values continue skb processing.
nf_hook_slow returns 3 different return value types:
A) a (negative) errno value: the skb was dropped (NF_DROP, e.g.
by iptables '-j DROP' rule).
B) 0. The skb was stolen by the hook or queued to userspace.
C) 1. all hooks returned NF_ACCEPT so the caller should invoke
the okfn so packet processing can continue.
nft ingress facility currently doesn't have the 'okfn' that
the NF_HOOK() macros use; there is no nfqueue support either.
So 1 means that nf_hook_ingress() caller should go on processing the skb.
In order to allow use of NF_STOLEN from ingress we need to translate
this to an errno number, else we'd crash because we continue with
already-free'd (or about to be free-d) skb.
The errno value isn't checked, its just important that its less than 0,
so return -1.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Gao Feng [Fri, 25 Nov 2016 04:32:07 +0000 (12:32 +0800)]
netfilter: xt_multiport: Fix wrong unmatch result with multiple ports
I lost one test case in the last commit for xt_multiport.
For example, the rule is "-m multiport --dports 22,80,443".
When first port is unmatched and the second is matched, the curent codes
could not return the right result.
It would return false directly when the first port is unmatched.
Fixes:
dd2602d00f80 ("netfilter: xt_multiport: Use switch case instead
of multiple condition checks")
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Thu, 24 Nov 2016 11:04:55 +0000 (12:04 +0100)]
netfilter: nft_payload: layer 4 checksum adjustment for pseudoheader fields
This patch adds a new flag that signals the kernel to update layer 4
checksum if the packet field belongs to the layer 4 pseudoheader. This
implicitly provides stateless NAT 1:1 that is useful under very specific
usecases.
Since rules mangling layer 3 fields that are part of the pseudoheader
may potentially convey any layer 4 packet, we have to deal with the
layer 4 checksum adjustment using protocol specific code.
This patch adds support for TCP, UDP and ICMPv6, since they include the
pseudoheader in the layer 4 checksum calculation. ICMP doesn't, so we
can skip it.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Liping Zhang [Wed, 23 Nov 2016 14:12:21 +0000 (22:12 +0800)]
netfilter: nft_fib_ipv4: initialize *dest to zero
Otherwise, if fib lookup fail, *dest will be filled with garbage value,
so reverse path filtering will not work properly:
# nft add rule x prerouting fib saddr oif eq 0 drop
Fixes:
f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Liping Zhang [Wed, 23 Nov 2016 14:12:20 +0000 (22:12 +0800)]
netfilter: nft_fib: convert htonl to ntohl properly
Acctually ntohl and htonl are identical, so this doesn't affect
anything, but it is conceptually wrong.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Tue, 22 Nov 2016 13:44:19 +0000 (14:44 +0100)]
netfilter: x_tables: pack percpu counter allocations
instead of allocating each xt_counter individually, allocate 4k chunks
and then use these for counter allocation requests.
This should speed up rule evaluation by increasing data locality,
also speeds up ruleset loading because we reduce calls to the percpu
allocator.
As Eric points out we can't use PAGE_SIZE, page_allocator would fail on
arches with 64k page size.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Tue, 22 Nov 2016 13:44:18 +0000 (14:44 +0100)]
netfilter: x_tables: pass xt_counters struct to counter allocator
Keeps some noise away from a followup patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Tue, 22 Nov 2016 13:44:17 +0000 (14:44 +0100)]
netfilter: x_tables: pass xt_counters struct instead of packet counter
On SMP we overload the packet counter (unsigned long) to contain
percpu offset. Hide this from callers and pass xt_counters address
instead.
Preparation patch to allocate the percpu counters in page-sized batch
chunks.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Aaron Conole [Tue, 15 Nov 2016 22:48:46 +0000 (17:48 -0500)]
netfilter: convert while loops to for loops
This is to facilitate converting from a singly-linked list to an array
of elements.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Aaron Conole [Tue, 15 Nov 2016 22:48:45 +0000 (17:48 -0500)]
netfilter: decouple nf_hook_entry and nf_hook_ops
During nfhook traversal we only need a very small subset of
nf_hook_ops members.
We need:
- next element
- hook function to call
- hook function priv argument
Bridge netfilter also needs 'thresh'; can be obtained via ->orig_ops.
nf_hook_entry struct is now 32 bytes on x86_64.
A followup patch will turn the run-time list into an array that only
stores hook functions plus their priv arguments, eliminating the ->next
element.
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Aaron Conole [Tue, 15 Nov 2016 22:48:44 +0000 (17:48 -0500)]
netfilter: introduce accessor functions for hook entries
This allows easier future refactoring.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Tue, 15 Nov 2016 20:36:45 +0000 (21:36 +0100)]
netfilter: defrag: only register defrag functionality if needed
nf_defrag modules for ipv4 and ipv6 export an empty stub function.
Any module that needs the defragmentation hooks registered simply 'calls'
this empty function to create a phony module dependency -- modprobe will
then load the defrag module too.
This extends netfilter ipv4/ipv6 defragmentation modules to delay the hook
registration until the functionality is requested within a network namespace
instead of module load time for all namespaces.
Hooks are only un-registered on module unload or when a namespace that used
such defrag functionality exits.
We have to use struct net for this as the register hooks can be called
before netns initialization here from the ipv4/ipv6 conntrack module
init path.
There is no unregister functionality support, defrag will always be
active once it was requested inside a net namespace.
The reason is that defrag has impact on nft and iptables rulesets
(without defrag we might see framents).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Tue, 15 Nov 2016 20:36:44 +0000 (21:36 +0100)]
netfilter: conntrack: add nf_conntrack_default_on sysctl
This switch (default on) can be used to disable automatic registration
of connection tracking functionality in newly created network
namespaces.
This means that when net namespace goes down (or the tracker protocol
module is unloaded) we *might* have to unregister the hooks.
We can either add another per-netns variable that tells if
the hooks got registered by default, or, alternatively, just call
the protocol _put() function and have the callee deal with a possible
'extra' put() operation that doesn't pair with a get() one.
This uses the latter approach, i.e. a put() without a get has no effect.
Conntrack is still enabled automatically regardless of the new sysctl
setting if the new net namespace requires connection tracking, e.g. when
NAT rules are created.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Tue, 15 Nov 2016 20:36:43 +0000 (21:36 +0100)]
netfilter: conntrack: register hooks in netns when needed by ruleset
This makes use of nf_ct_netns_get/put added in previous patch.
We add get/put functions to nf_conntrack_l3proto structure, ipv4 and ipv6
then implement use-count to track how many users (nft or xtables modules)
have a dependency on ipv4 and/or ipv6 connection tracking functionality.
When count reaches zero, the hooks are unregistered.
This delays activation of connection tracking inside a namespace until
stateful firewall rule or nat rule gets added.
This patch breaks backwards compatibility in the sense that connection
tracking won't be active anymore when the protocol tracker module is
loaded. This breaks e.g. setups that ctnetlink for flow accounting and
the like, without any '-m conntrack' packet filter rules.
Followup patch restores old behavour and makes new delayed scheme
optional via sysctl.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Tue, 15 Nov 2016 20:36:42 +0000 (21:36 +0100)]
netfilter: nf_tables: add conntrack dependencies for nat/masq/redir expressions
so that conntrack core will add the needed hooks in this namespace.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Tue, 15 Nov 2016 20:36:41 +0000 (21:36 +0100)]
netfilter: nat: add dependencies on conntrack module
MASQUERADE, S/DNAT and REDIRECT already call functions that depend on the
conntrack module.
However, since the conntrack hooks are now registered in a lazy fashion
(i.e., only when needed) a symbol reference is not enough.
Thus, when something is added to a nat table, make sure that it will see
packets by calling nf_ct_netns_get() which will register the conntrack
hooks in the current netns.
An alternative would be to add these dependencies to the NAT table.
However, that has problems when using non-modular builds -- we might
register e.g. ipv6 conntrack before its initcall has run, leading to NULL
deref crashes since its per-netns storage has not yet been allocated.
Adding the dependency in the modules instead has the advantage that nat
table also does not register its hooks until rules are added.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Tue, 15 Nov 2016 20:36:40 +0000 (21:36 +0100)]
netfilter: add and use nf_ct_netns_get/put
currently aliased to try_module_get/_put.
Will be changed in next patch when we add functions to make use of ->net
argument to store usercount per l3proto tracker.
This is needed to avoid registering the conntrack hooks in all netns and
later only enable connection tracking in those that need conntrack.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Tue, 15 Nov 2016 20:36:39 +0000 (21:36 +0100)]
netfilter: conntrack: remove unused init_net hook
since
adf0516845bcd0 ("netfilter: remove ip_conntrack* sysctl compat code")
the only user (ipv4 tracker) sets this to an empty stub function.
After this change nf_ct_l3proto_pernet_register() is also empty,
but this will change in a followup patch to add conditional register
of the hooks.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Davide Caratti [Tue, 15 Nov 2016 14:08:27 +0000 (15:08 +0100)]
netfilter: conntrack: built-in support for UDPlite
CONFIG_NF_CT_PROTO_UDPLITE is no more a tristate. When set to y,
connection tracking support for UDPlite protocol is built-in into
nf_conntrack.ko.
footprint test:
$ ls -l net/netfilter/nf_conntrack{_proto_udplite,}.ko \
net/ipv4/netfilter/nf_conntrack_ipv4.ko \
net/ipv6/netfilter/nf_conntrack_ipv6.ko
(builtin)|| udplite| ipv4 | ipv6 |nf_conntrack
---------++--------+--------+--------+--------------
none || 432538 | 828755 | 828676 |
6141434
UDPlite || - | 829649 | 829362 |
6498204
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Davide Caratti [Tue, 15 Nov 2016 14:08:26 +0000 (15:08 +0100)]
netfilter: conntrack: built-in support for SCTP
CONFIG_NF_CT_PROTO_SCTP is no more a tristate. When set to y, connection
tracking support for SCTP protocol is built-in into nf_conntrack.ko.
footprint test:
$ ls -l net/netfilter/nf_conntrack{_proto_sctp,}.ko \
net/ipv4/netfilter/nf_conntrack_ipv4.ko \
net/ipv6/netfilter/nf_conntrack_ipv6.ko
(builtin)|| sctp | ipv4 | ipv6 | nf_conntrack
---------++--------+--------+--------+--------------
none || 498243 | 828755 | 828676 |
6141434
SCTP || - | 829254 | 829175 |
6547872
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Davide Caratti [Tue, 15 Nov 2016 14:08:25 +0000 (15:08 +0100)]
netfilter: conntrack: built-in support for DCCP
CONFIG_NF_CT_PROTO_DCCP is no more a tristate. When set to y, connection
tracking support for DCCP protocol is built-in into nf_conntrack.ko.
footprint test:
$ ls -l net/netfilter/nf_conntrack{_proto_dccp,}.ko \
net/ipv4/netfilter/nf_conntrack_ipv4.ko \
net/ipv6/netfilter/nf_conntrack_ipv6.ko
(builtin)|| dccp | ipv4 | ipv6 | nf_conntrack
---------++--------+--------+--------+--------------
none || 469140 | 828755 | 828676 |
6141434
DCCP || - | 830566 | 829935 |
6533526
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Davide Caratti [Tue, 15 Nov 2016 14:08:24 +0000 (15:08 +0100)]
netfilter: nf_conntrack_tuple_common.h: fix #include
To allow usage of enum ip_conntrack_dir in include/net/netns/conntrack.h,
this patch encloses #include <linux/netfilter.h> in a #ifndef __KERNEL__
directive, so that compiler errors caused by unwanted inclusion of
include/linux/netfilter.h are avoided.
In addition, #include <linux/netfilter/nf_conntrack_common.h> line has
been added to resolve correctly CTINFO2DIR macro.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 4 Dec 2016 19:46:16 +0000 (20:46 +0100)]
Merge tag 'ipvs-for-v4.10' of https://git./linux/kernel/git/horms/ipvs-next
Simon Horman says:
====================
IPVS Updates for v4.10
please consider these enhancements to the IPVS for v4.10.
* Decrement the IP ttl in all the modes in order to prevent infinite
route loops. Thanks to Dwip Banerjee.
* Use IS_ERR_OR_NULL macro. Clean-up from Gao Feng.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Liping Zhang [Mon, 14 Nov 2016 14:41:08 +0000 (22:41 +0800)]
netfilter: nfnetlink_log: add "nf-logger-5-1" module alias name
So we can autoload nfnetlink_log.ko when the user adding nft log
group X rule in netdev family.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Liping Zhang [Mon, 14 Nov 2016 14:39:25 +0000 (22:39 +0800)]
netfilter: nf_log: do not assume ethernet header in netdev family
In netdev family, we will handle non ethernet packets, so using
eth_hdr(skb)->h_proto is incorrect.
Meanwhile, we can use socket(AF_PACKET...) to sending packets, so
skb->protocol is not always set in bridge family.
Add an extra parameter into nf_log_l2packet to solve this issue.
Fixes:
1fddf4bad0ac ("netfilter: nf_log: add packet logging for netdev family")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Davide Caratti [Thu, 20 Oct 2016 16:33:03 +0000 (18:33 +0200)]
netfilter: built-in NAT support for UDPlite
CONFIG_NF_NAT_PROTO_UDPLITE is no more a tristate. When set to y, NAT
support for UDPlite protocol is built-in into nf_nat.ko.
footprint test:
(nf_nat_proto_) |udplite || nf_nat
--------------------------+--------++--------
no builtin | 408048 ||
2241312
UDPLITE builtin | - ||
2577256
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Davide Caratti [Thu, 20 Oct 2016 16:33:02 +0000 (18:33 +0200)]
netfilter: built-in NAT support for SCTP
CONFIG_NF_NAT_PROTO_SCTP is no more a tristate. When set to y, NAT
support for SCTP protocol is built-in into nf_nat.ko.
footprint test:
(nf_nat_proto_) | sctp || nf_nat
--------------------------+--------++--------
no builtin | 428344 ||
2241312
SCTP builtin | - ||
2597032
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Davide Caratti [Thu, 20 Oct 2016 16:33:01 +0000 (18:33 +0200)]
netfilter: built-in NAT support for DCCP
CONFIG_NF_NAT_PROTO_DCCP is no more a tristate. When set to y, NAT
support for DCCP protocol is built-in into nf_nat.ko.
footprint test:
(nf_nat_proto_) | dccp || nf_nat
--------------------------+--------++--------
no builtin | 409800 ||
2241312
DCCP builtin | - ||
2578968
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Arturo Borrero Gonzalez [Tue, 18 Oct 2016 12:02:29 +0000 (14:02 +0200)]
netfilter: update Arturo Borrero Gonzalez email address
The email address has changed, let's update the copyright statements.
Signed-off-by: Arturo Borrero Gonzalez <arturo@debian.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Erik Nordmark [Fri, 2 Dec 2016 22:00:08 +0000 (14:00 -0800)]
ipv6 addrconf: Implemented enhanced DAD (RFC7527)
Implemented RFC7527 Enhanced DAD.
IPv6 duplicate address detection can fail if there is some temporary
loopback of Ethernet frames. RFC7527 solves this by including a random
nonce in the NS messages used for DAD, and if an NS is received with the
same nonce it is assumed to be a looped back DAD probe and is ignored.
RFC7527 is enabled by default. Can be disabled by setting both of
conf/{all,interface}/enhanced_dad to zero.
Signed-off-by: Erik Nordmark <nordmark@arista.com>
Signed-off-by: Bob Gilligan <gilligan@arista.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 4 Dec 2016 04:18:39 +0000 (23:18 -0500)]
Merge branch 'mv88e6390-batch-three'
Andrew Lunn says:
====================
mv88e6390 batch 3
More patches to support the MV88e6390. This is mostly refactoring
existing code and adding implementations for the mv88e6390. This
patchset set which reserved frames are sent to the cpu, the size of
jumbo frames that will be accepted, turn off egress rate limiting, and
configuration of pause frames.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Sat, 3 Dec 2016 03:45:20 +0000 (04:45 +0100)]
net: dsa: mv88e6xxx: Implement mv88e6390 pause control
The mv88e6390 has a number flow control registers accessed via the
Flow Control register. Use these to set the pause control.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Sat, 3 Dec 2016 03:45:19 +0000 (04:45 +0100)]
net: dsa: mv88e6xxx: Refactor pause configuration
The mv88e6390 has a different mechanism for configuring pause.
Refactor the code into an ops function, and for the moment, don't add
any mv88e6390 code yet.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Sat, 3 Dec 2016 03:45:18 +0000 (04:45 +0100)]
net: dsa: mv88e6xxx: Refactor egress rate limiting
There are two different rate limiting configurations, depending on the
switch generation. Refactor this into ops.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Sat, 3 Dec 2016 03:45:17 +0000 (04:45 +0100)]
net: dsa: mv88e6xxx: Refactor setting of jumbo frames
Some switches support jumbo frames. Refactor this code into operations
in the ops structure.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Sat, 3 Dec 2016 03:45:16 +0000 (04:45 +0100)]
net: dsa: mv88e6xxx: Reserved Management frames to CPU
Older devices have a couple of registers in global2. The mv88e6390
family has a single register in global1 behind which hides similar
configuration. Implement and op for this.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 4 Dec 2016 04:15:01 +0000 (23:15 -0500)]
Merge branch 'mv88e6390-batch-two'
Andrew Lunn says:
====================
MV88E6390 batch two
This is the second batch of patches adding support for the
MV88e6390. They are not sufficient to make it work properly.
The mv88e6390 has a much expanded set of priority maps. Refactor the
existing code, and implement basic support for the new device.
Similarly, the monitor control register has been reworked.
The mv88e6390 has something odd in its EDSA tagging implementation,
which means it is not possible to use it. So we need to use DSA
tagging. This is the first device with EDSA support where we need to
use DSA, and the code does not support this. So two patches refactor
the existing code. The two different register definitions are
separated out, and using DSA on an EDSA capable device is added.
v2:
Add port prefix
Add helper function for 6390
Add _IEEE_ into #defines
Split monitor_ctrl into a number of separate ops.
Remove 6390 code which is management, used in a later patch
s/EGREES/EGRESS/.
Broke up setup_port_dsa() and set_port_dsa() into a number of ops
v3:
Verify mandatory ops for port setup
Don't set ether type for DSA port.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Sat, 3 Dec 2016 03:35:19 +0000 (04:35 +0100)]
net: dsa: mv88e6xxx: Refactor CPU and DSA port setup
Older chips only support DSA tagging. Newer chips have both DSA and
EDSA tagging. Refactor the code by adding port functions for setting the
frame mode, egress mode, and if to forward unknown frames.
This results in the helper mv88e6xxx_6065_family() becoming unused, so
remove it.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
v3:
Verify mandatory ops for port setup
Don't set ether type for DSA port.
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Sat, 3 Dec 2016 03:35:18 +0000 (04:35 +0100)]
net: dsa: mv88e6xxx: Move the tagging protocol into info
Older chips support a single tagging protocol, DSA. New chips support
both DSA and EDSA, an enhanced version. Having both as an option
changes the register layouts. Up until now, it has been assumed that
if EDSA is supported, it will be used. Hence the register layout has
been determined by which protocol should be used. However, mv88e6390
has a different implementation of EDSA, which requires we need to use
the DSA tagging. Hence separate the selection of the protocol from the
register layout.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Sat, 3 Dec 2016 03:35:17 +0000 (04:35 +0100)]
net: dsa: mv88e6xxx: Monitor and Management tables
The mv88e6390 changes the monitor control register into the Monitor
and Management control, which is an indirection register to various
registers.
Add ops to set the CPU port and the ingress/egress port for both
register layouts, to global1
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Sat, 3 Dec 2016 03:35:16 +0000 (04:35 +0100)]
net: dsa: mv88e6xxx: Implement mv88e6390 tag remap
The mv88e6390 does not have the two registers to set the frame
priority map. Instead it has an indirection registers for setting a
number of different priority maps. Refactor the old code into an
function, implement the mv88e6390 version, and use an op to call the
right one.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 4 Dec 2016 00:29:37 +0000 (19:29 -0500)]
Merge branch 'fib-notifier-event-replay'
Jiri Pirko says:
====================
ipv4: fib: Replay events when registering FIB notifier
Ido says:
In kernel 4.9 the switchdev-specific FIB offload mechanism was replaced
by a new FIB notification chain to which modules could register in order
to be notified about the addition and deletion of FIB entries. The
motivation for this change was that switchdev drivers need to be able to
reflect the entire FIB table and not only FIBs configured on top of the
port netdevs themselves. This is useful in case of in-band management.
The fundamental problem with this approach is that upon registration
listeners lose all the information previously sent in the chain and
thus have an incomplete view of the FIB tables, which can result in
packet loss. This patchset fixes that by dumping the FIB tables and
replaying notifications previously sent in the chain for the registered
notification block.
The entire dump process is done under RCU and thus the FIB notification
chain is converted to be atomic. The listeners are modified accordingly.
This is done in the first eight patches.
The ninth patch adds a change sequence counter to ensure the integrity
of the FIB dump. The last patch adds the dump itself to the FIB chain
registration function and modifies existing listeners to pass a callback
to be executed in case dump was inconsistent.
---
v3->v4:
- Register the notification block after the dump and protect it using
the change sequence counter (Hannes Frederic Sowa).
- Since we now integrate the dump into the registration function, drop
the sysctl to set maximum number of retries and instead set it to a
fixed number. Lets see if it's really a problem before adding something
we can never remove.
- For the same reason, dump FIB tables for all net namespaces.
- Add a comment regarding guarantees provided by mutex semantics.
v2->v3:
- Add sysctl to set the number of FIB dump retries (Hannes Frederic Sowa).
- Read the sequence counter under RTNL to ensure synchronization
between the dump process and other processes changing the routing
tables (Hannes Frederic Sowa).
- Pass a callback to the dump function to be executed prior to a retry.
- Limit the dump to a single net namespace.
v1->v2:
- Add a sequence counter to ensure the integrity of the FIB dump
(David S. Miller, Hannes Frederic Sowa).
- Protect notifications from re-ordering in listeners by using an
ordered workqueue (Hannes Frederic Sowa).
- Introduce fib_info_hold() (Jiri Pirko).
- Relieve rocker from the need to invoke the FIB dump by registering
to the FIB notification chain prior to ports creation.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 3 Dec 2016 15:45:07 +0000 (16:45 +0100)]
ipv4: fib: Replay events when registering FIB notifier
Commit
b90eb7549499 ("fib: introduce FIB notification infrastructure")
introduced a new notification chain to notify listeners (f.e., switchdev
drivers) about addition and deletion of routes.
However, upon registration to the chain the FIB tables can already be
populated, which means potential listeners will have an incomplete view
of the tables.
Solve that by dumping the FIB tables and replaying the events to the
passed notification block. The dump itself is done using RCU in order
not to starve consumers that need RTNL to make progress.
The integrity of the dump is ensured by reading the FIB change sequence
counter before and after the dump under RTNL. This allows us to avoid
the problematic situation in which the dumping process sends a ENTRY_ADD
notification following ENTRY_DEL generated by another process holding
RTNL.
Callers of the registration function may pass a callback that is
executed in case the dump was inconsistent with current FIB tables.
The number of retries until a consistent dump is achieved is set to a
fixed number to prevent callers from looping for long periods of time.
In case current limit proves to be problematic in the future, it can be
easily converted to be configurable using a sysctl.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 3 Dec 2016 15:45:06 +0000 (16:45 +0100)]
ipv4: fib: Allow for consistent FIB dumping
The next patch will enable listeners of the FIB notification chain to
request a dump of the FIB tables. However, since RTNL isn't taken during
the dump, it's possible for the FIB tables to change mid-dump, which
will result in inconsistency between the listener's table and the
kernel's.
Allow listeners to know about changes that occurred mid-dump, by adding
a change sequence counter to each net namespace. The counter is
incremented just before a notification is sent in the FIB chain.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 3 Dec 2016 15:45:05 +0000 (16:45 +0100)]
ipv4: fib: Convert FIB notification chain to be atomic
In order not to hold RTNL for long periods of time we're going to dump
the FIB tables using RCU.
Convert the FIB notification chain to be atomic, as we can't block in
RCU critical sections.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 3 Dec 2016 15:45:04 +0000 (16:45 +0100)]
rocker: Register FIB notifier before creating ports
We can miss FIB notifications sent between the time the ports were
created and the FIB notification block registered.
Instead of receiving these notifications only when they are replayed for
the FIB notification block during registration, just register the
notification block before the ports are created.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 3 Dec 2016 15:45:03 +0000 (16:45 +0100)]
rocker: Implement FIB offload in deferred work
Convert rocker to offload FIBs in deferred work in a similar fashion to
mlxsw, which was converted in the previous commits.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 3 Dec 2016 15:45:02 +0000 (16:45 +0100)]
rocker: Create an ordered workqueue for FIB offload
As explained in the previous commits, we need to process FIB entries
addition / deletion events in FIFO order or otherwise we can have a
mismatch between the kernel's FIB table and the device's.
Create an ordered workqueue for rocker to which these work items will be
submitted to.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 3 Dec 2016 15:45:01 +0000 (16:45 +0100)]
mlxsw: spectrum_router: Implement FIB offload in deferred work
FIB offload is currently done in process context with RTNL held, but
we're about to dump the FIB tables in RCU critical section, so we can no
longer sleep.
Instead, defer the operation to process context using deferred work. Make
sure fib info isn't freed while the work is queued by taking a reference
on it and releasing it after the operation is done.
Deferring the operation is valid because the upper layers always assume
the operation was successful. If it's not, then the driver-specific
abort mechanism is called and all routed traffic is directed to slow
path.
The work items are submitted to an ordered workqueue to prevent a
mismatch between the kernel's FIB table and the device's.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 3 Dec 2016 15:45:00 +0000 (16:45 +0100)]
mlxsw: core: Create an ordered workqueue for FIB offload
We're going to start processing FIB entries addition / deletion events
in deferred work. These work items must be processed in the order they
were submitted or otherwise we can have differences between the kernel's
FIB table and the device's.
Solve this by creating an ordered workqueue to which these work items
will be submitted to. Note that we can't simply convert the current
workqueue to be ordered, as EMADs re-transmissions are also processed in
deferred work.
Later on, we can migrate other work items to this workqueue, such as FDB
notification processing and nexthop resolution, since they all take the
same lock anyway.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 3 Dec 2016 15:44:59 +0000 (16:44 +0100)]
ipv4: fib: Add fib_info_hold() helper
As explained in the previous commit, modules are going to need to take a
reference on fib info and then drop it using fib_info_put().
Add the fib_info_hold() helper to make the code more readable and also
symmetric with fib_info_put().
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Suggested-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sat, 3 Dec 2016 15:44:58 +0000 (16:44 +0100)]
ipv4: fib: Export free_fib_info()
The FIB notification chain is going to be converted to an atomic chain,
which means switchdev drivers will have to offload FIB entries in
deferred work, as hardware operations entail sleeping.
However, while the work is queued fib info might be freed, so a
reference must be taken. To release the reference (and potentially free
the fib info) fib_info_put() will be called, which in turn calls
free_fib_info().
Export free_fib_info() so that modules will be able to invoke
fib_info_put().
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Sat, 3 Dec 2016 18:36:01 +0000 (10:36 -0800)]
act_mirred: fix a typo in get_dev
Fixes:
255cb30425c0 ("net/sched: act_mirred: Add new tc_action_ops get_dev()")
Cc: Hadar Hen Zion <hadarh@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 4 Dec 2016 00:10:48 +0000 (19:10 -0500)]
Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2016-12-02
This series contains updates to i40e and i40evf only.
Alex provides changes so that we are much more robust about defining what
we can and cannot offload in i40e and i40evf by doing additional checks
other than L4 tunnel header length.
Jake provides several fixes/changes, first cleaning up a label that is
unnecessary, as well as cleaned up the use of a "magic number". Clarified
the code by separating the global private flags and the regular private
flags per interface into two arrays, so that future additions will not
produce duplication and buggy code. Adds additional checks to protect
against NULL values for msix_entries and q_vectors pointers.
Michal adds Clause22 method for accessing registers for some external
PHYs.
Piotr adds additional protocol support for the admin queue discover
capabilities function.
Tushar Dave fixes a panic seen on SPARC, where writel() should not be
used to write directly to a memory address but only to a memory mapped
I/O address otherwise it causes data access exceptions.
Joe Perches separates out a section of code into its own function, to
help reduce i40evf_reset_task() a bit.
Alan fixes an issue by checking for NULL before dereferencing msix_entries
and returning early in the case where it is NULL within the i40evf_close()
code path.
Henry provides code cleanup to remove unreachable and redundant sections
of code. Fixed up an issue where new NICs were not identifying "unknown
PHYs" correctly.
Harshitha fixes a issue where the ethtool "Supported Link" modes list
backplane interfaces on X722 devices for 10 GbE with SFP+ and Cortina
retimer, where these interfaces should not be visible to the user since
they cannot use them.
Carolyn changes an X722 informational message so that it only appears
when extra messages are desired.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung Cheng [Sat, 3 Dec 2016 22:46:22 +0000 (14:46 -0800)]
tcp: fix the missing avr32 SOF_TIMESTAMPING_OPT_STATS
The commit of SOF_TIMESTAMPING_OPT_STATS didn't include the
new header for avr32, causing build to break. The patch fixes it.
Fixes:
1c885808e456 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING")
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 2 Dec 2016 16:35:49 +0000 (17:35 +0100)]
udp: be less conservative with sock rmem accounting
Before commit
850cbaddb52d ("udp: use it's own memory accounting
schema"), the udp protocol allowed sk_rmem_alloc to grow beyond
the rcvbuf by the whole current packet's truesize. After said commit
we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue
is empty. As reported by Jesper this cause a performance regression
for some (small) values of rcvbuf.
This commit is intended to fix the regression restoring the old
handling of the rcvbuf limit.
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Fixes:
850cbaddb52d ("udp: use it's own memory accounting schema")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 2 Dec 2016 16:11:00 +0000 (08:11 -0800)]
net_sched: gen_estimator: account for timer drifts
Under heavy stress, timer used in estimators tend to slowly be delayed
by a few jiffies, leading to inaccuracies.
Lets remember what was the last scheduled jiffies so that we get more
precise estimations, without having to add a multiply/divide in the loop
to account for the drifts.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Fri, 2 Dec 2016 15:51:33 +0000 (15:51 +0000)]
sfc: remove EFX_BUG_ON_PARANOID, use EFX_WARN_ON_[ONCE_]PARANOID instead
Logically, EFX_BUG_ON_PARANOID can never be correct. For, BUG_ON should
only be used if it is not possible to continue without potential harm;
and since the non-DEBUG driver will continue regardless (as the BUG_ON is
compiled out), clearly the BUG_ON cannot be needed in the DEBUG driver.
So, replace every EFX_BUG_ON_PARANOID with either an EFX_WARN_ON_PARANOID
or the newly defined EFX_WARN_ON_ONCE_PARANOID.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 3 Dec 2016 21:08:01 +0000 (16:08 -0500)]
Merge branch 'samples-bpf-automated-cgroup-tests'
Sargun Dhillon says:
====================
samples, bpf: Refactor; Add automated tests for cgroups
These two patches are around refactoring out some old, reusable code from the
existing test_current_task_under_cgroup_user test, and adding a new, automated
test.
There is some generic cgroupsv2 setup & cleanup code, given that most
environment still don't have it setup by default. With this code, we're able
to pretty easily add an automated test for future cgroupsv2 functionality.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sargun Dhillon [Fri, 2 Dec 2016 10:42:32 +0000 (02:42 -0800)]
samples, bpf: Add automated test for cgroup filter attachments
This patch adds the sample program test_cgrp2_attach2. This program is
similar to test_cgrp2_attach, but it performs automated testing of the
cgroupv2 BPF attached filters. It runs the following checks:
* Simple filter attachment
* Application of filters to child cgroups
* Overriding filters on child cgroups
* Checking that this still works when the parent filter is removed
The filters that are used here are simply allow all / deny all filters, so
it isn't checking the actual functionality of the filters, but rather
the behaviour around detachment / attachment. If net_cls is enabled,
this test will fail.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sargun Dhillon [Fri, 2 Dec 2016 10:42:18 +0000 (02:42 -0800)]
samples, bpf: Refactor test_current_task_under_cgroup - separate out helpers
This patch modifies test_current_task_under_cgroup_user. The test has
several helpers around creating a temporary environment for cgroup
testing, and moving the current task around cgroups. This set of
helpers can then be used in other tests.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Fri, 2 Dec 2016 02:31:12 +0000 (18:31 -0800)]
samples/bpf: silence compiler warnings
silence some of the clang compiler warnings like:
include/linux/fs.h:2693:9: warning: comparison of unsigned enum expression < 0 is always false
arch/x86/include/asm/processor.h:491:30: warning: taking address of packed member 'sp0' of class or structure 'x86_hw_tss' may result in an unaligned pointer value
include/linux/cgroup-defs.h:326:16: warning: field 'cgrp' with variable sized type 'struct cgroup' not at the end of a struct or class is a GNU extension
since they add too much noise to samples/bpf/ build.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey Dobriyan [Fri, 2 Dec 2016 01:21:32 +0000 (04:21 +0300)]
netns: fix net_generic() "id - 1" bloat
net_generic() function is both a) inline and b) used ~600 times.
It has the following code inside
...
ptr = ng->ptr[id - 1];
...
"id" is never compile time constant so compiler is forced to subtract 1.
And those decrements or LEA [r32 - 1] instructions add up.
We also start id'ing from 1 to catch bugs where pernet sybsystem id
is not initialized and 0. This is quite pointless idea (nothing will
work or immediate interference with first registered subsystem) in
general but it hints what needs to be done for code size reduction.
Namely, overlaying allocation of pointer array and fixed part of
structure in the beginning and using usual base-0 addressing.
Ids are just cookies, their exact values do not matter, so lets start
with 3 on x86_64.
Code size savings (oh boy): -4.2 KB
As usual, ignore the initial compiler stupidity part of the table.
add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
function old new delta
tipc_nametbl_insert_publ 1250 1270 +20
nlmclnt_lookup_host 686 703 +17
nfsd4_encode_fattr 5930 5941 +11
nfs_get_client 1050 1061 +11
register_pernet_operations 333 342 +9
tcf_mirred_init 843 849 +6
tcf_bpf_init 1143 1149 +6
gss_setup_upcall 990 994 +4
idmap_name_to_id 432 434 +2
ops_init 274 275 +1
nfsd_inject_forget_client 259 260 +1
nfs4_alloc_client 612 613 +1
tunnel_key_walker 164 163 -1
...
tipc_bcbase_select_primary 392 360 -32
mac80211_hwsim_new_radio 2808 2767 -41
ipip6_tunnel_ioctl 2228 2186 -42
tipc_bcast_rcv 715 672 -43
tipc_link_build_proto_msg 1140 1089 -51
nfsd4_lock 3851 3796 -55
tipc_mon_rcv 1012 956 -56
Total: Before=
156643951, After=
156639743, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey Dobriyan [Fri, 2 Dec 2016 01:12:58 +0000 (04:12 +0300)]
netns: add dummy struct inside "struct net_generic"
This is precursor to fixing "[id - 1]" bloat inside net_generic().
Name "s" is chosen to complement name "u" often used for dummy unions.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey Dobriyan [Fri, 2 Dec 2016 01:11:34 +0000 (04:11 +0300)]
netns: publish net_generic correctly
Publishing net_generic pointer is done with silly mistake: new array is
published BEFORE setting freshly acquired pernet subsystem pointer.
memcpy
rcu_assign_pointer
kfree_rcu
ng->ptr[id - 1] = data;
This bug was introduced with commit
dec827d174d7f76c457238800183ca864a639365
("[NETNS]: The generic per-net pointers.") in the glorious days of
chopping networking stack into containers proper 8.5 years ago (whee...)
How it didn't trigger for so long?
Well, you need quite specific set of conditions:
*) race window opens once per pernet subsystem addition
(read: modprobe or boot)
*) not every pernet subsystem is eligible (need ->id and ->size)
*) not every pernet subsystem is vulnerable (need incorrect or absense
of ordering of register_pernet_sybsys() and actually using net_generic())
*) to hide the bug even more, default is to preallocate 13 pointers which
is actually quite a lot. You need IPv6, netfilter, bridging etc together
loaded to trigger reallocation in the first place. Trimmed down
config are OK.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey Dobriyan [Fri, 2 Dec 2016 00:59:06 +0000 (03:59 +0300)]
netlink: 2-clause nla_ok()
nla_ok() consists of 3 clauses:
1) int rem >= (int)sizeof(struct nlattr)
2) u16 nla_len >= sizeof(struct nlattr)
3) u16 nla_len <= int rem
The statement is that clause (1) is redundant.
What it does is ensuring that "rem" is a positive number,
so that in clause (3) positive number will be compared to positive number
with no problems.
However, "u16" fully fits into "int" and integers do not change value
when upcasting even to signed type. Negative integers will be rejected
by clause (3) just fine. Small positive integers will be rejected
by transitivity of comparison operator.
NOTE: all of the above DOES NOT apply to nlmsg_ok() where ->nlmsg_len is
u32(!), so 3 clauses AND A CAST TO INT are necessary.
Obligatory space savings report: -1.6 KB
$ ./scripts/bloat-o-meter ../vmlinux-000* ../vmlinux-001*
add/remove: 0/0 grow/shrink: 3/63 up/down: 35/-1692 (-1657)
function old new delta
validate_scan_freqs 142 155 +13
tcf_em_tree_validate 867 879 +12
dcbnl_ieee_del 328 338 +10
netlbl_cipsov4_add_common.isra 218 215 -3
...
ovs_nla_put_actions 888 806 -82
netlbl_cipsov4_add_std 1648 1566 -82
nl80211_parse_sched_scan 2889 2780 -109
ip_tun_from_nlattr 3086 2945 -141
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhang Shengju [Fri, 2 Dec 2016 01:51:07 +0000 (09:51 +0800)]
staging: wilc1000: use reset to set mac header
Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhang Shengju [Fri, 2 Dec 2016 01:51:06 +0000 (09:51 +0800)]
iwlwifi: use reset to set transport header
Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhang Shengju [Fri, 2 Dec 2016 01:51:05 +0000 (09:51 +0800)]
mlx4: use reset to set mac header
Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhang Shengju [Fri, 2 Dec 2016 01:51:04 +0000 (09:51 +0800)]
bnx2x: use reset to set network header
Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhang Shengju [Fri, 2 Dec 2016 01:51:03 +0000 (09:51 +0800)]
qede: use reset to set network header
Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 3 Dec 2016 20:46:51 +0000 (15:46 -0500)]
Merge branch 'xgene-jumbo-and-pause-frame'
Iyappan Subramanian says:
====================
drivers: net: xgene: Add Jumbo and Pause frame support
This patch set adds,
1. Jumbo frame support
2. Pause frame based flow control
and fixes RSS for non-TCP/UDP packets.
====================
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Iyappan Subramanian [Fri, 2 Dec 2016 00:41:44 +0000 (16:41 -0800)]
drivers: net: xgene: ethtool: Add get/set_pauseparam
This patch adds get_pauseparam and set_pauseparam functions.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Fri, 2 Dec 2016 00:41:43 +0000 (16:41 -0800)]
drivers: net: xgene: Add flow control initialization
This patch adds flow control/pause frame initialization and
advertising capabilities.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Fri, 2 Dec 2016 00:41:42 +0000 (16:41 -0800)]
drivers: net: xgene: Add flow control configuration
This patch adds functions to configure mac, when flow control
and pause frame settings change.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Fri, 2 Dec 2016 00:41:41 +0000 (16:41 -0800)]
drivers: net: xgene: fix: RSS for non-TCP/UDP
This patch fixes RSS feature, for non-TCP/UDP packets.
Signed-off-by: Khuong Dinh <kdinh@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Fri, 2 Dec 2016 00:41:40 +0000 (16:41 -0800)]
drivers: net: xgene: Add change_mtu function
This patch implements ndo_change_mtu() callback function that
enables mtu change.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Fri, 2 Dec 2016 00:41:39 +0000 (16:41 -0800)]
drivers: net: xgene: Add support for Jumbo frame
This patch adds support for jumbo frame, by allocating
additional buffer (page) pool and configuring the hardware.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Fri, 2 Dec 2016 00:41:38 +0000 (16:41 -0800)]
drivers: net: xgene: Configure classifier with pagepool
This patch configures classifier with the pagepool information.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Fri, 2 Dec 2016 00:41:37 +0000 (16:41 -0800)]
drivers: net: xgene: Add helper function
This is a prepartion patch and adds xgene_enet_get_fpsel() helper
function to get buffer pool number.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Gortmaker [Thu, 1 Dec 2016 20:25:28 +0000 (15:25 -0500)]
net: ethernet: ti: davinci_cpdma: add missing EXPORTs
As of commit
8f32b90981dcdb355516fb95953133f8d4e6b11d
("net: ethernet: ti: davinci_cpdma: add set rate for a channel") the
ARM allmodconfig builds would fail modpost with:
ERROR: "cpdma_chan_set_weight" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_min_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_set_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
Since these weren't declared as static, it is assumed they were
meant to be shared outside the file, and that modular build testing
was simply overlooked.
Fixes:
8f32b90981dc ("net: ethernet: ti: davinci_cpdma: add set rate for a channel")
Cc: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Cc: Mugunthan V N <mugunthanvnm@ti.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: linux-omap@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 3 Dec 2016 20:26:30 +0000 (15:26 -0500)]
Merge tag 'linux-can-next-for-4.10-
20161201' of git://git./linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2016-12-01
this is a pull request of 4 patches for net-next/master.
There are two patches by Chris Paterson for the rcar_can and rcar_canfd
device tree binding documentation. And a patch by Geert Uytterhoeven
that corrects the order of interrupt specifiers.
The fourth patch by Colin Ian King fixes a spelling error in the
kvaser_usb driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
LABBE Corentin [Thu, 1 Dec 2016 15:19:41 +0000 (16:19 +0100)]
net: stmmac: unify mdio functions
stmmac_mdio_{read|write} and stmmac_mdio_{read|write}_gmac4 are not
enought different for being split.
The only differences between thoses two functions are shift/mask for
addr/reg/clk_csr.
This patch introduce a per platform set of variable for setting thoses
shift/mask and unify mdio read and write functions.
Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
LABBE Corentin [Thu, 1 Dec 2016 15:19:40 +0000 (16:19 +0100)]
net: stmmac: avoid Camelcase naming
This patch simply rename regValue to value, like it was named in other
mdio functions.
Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Mon, 28 Nov 2016 14:19:43 +0000 (15:19 +0100)]
irda: w83977af_ir: fix damaged whitespace
As David Miller pointed out for for the previous patch, the whitespace
in some functions looks rather odd. This was caused by commit
6329da5f258a
("obsolete config in kernel source: USE_INTERNAL_TIMER"), which removed
some conditions but did not reindent the code.
This fixes the indentation in the file and removes extraneous whitespace
at the end of the lines and before tabs.
There are many other minor coding style problems in the driver, but I'm
not touching those here.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pavel Machek [Thu, 1 Dec 2016 10:32:18 +0000 (11:32 +0100)]
stmmac: cleanup documenation, make it match reality
Fix english in documentation, make documentation match reality, remove
options that were removed from code.
Signed-off-by: Pavel Machek <pavel@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 3 Dec 2016 16:46:54 +0000 (11:46 -0500)]
Merge git://git./linux/kernel/git/davem/net
Couple conflicts resolved here:
1) In the MACB driver, a bug fix to properly initialize the
RX tail pointer properly overlapped with some changes
to support variable sized rings.
2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
overlapping with a reorganization of the driver to support
ACPI, OF, as well as PCI variants of the chip.
3) In 'net' we had several probe error path bug fixes to the
stmmac driver, meanwhile a lot of this code was cleaned up
and reorganized in 'net-next'.
4) The cls_flower classifier obtained a helper function in
'net-next' called __fl_delete() and this overlapped with
Daniel Borkamann's bug fix to use RCU for object destruction
in 'net'. It also overlapped with Jiri's change to guard
the rhashtable_remove_fast() call with a check against
tc_skip_sw().
5) In mlx4, a revert bug fix in 'net' overlapped with some
unrelated changes in 'net-next'.
6) In geneve, a stale header pointer after pskb_expand_head()
bug fix in 'net' overlapped with a large reorganization of
the same code in 'net-next'. Since the 'net-next' code no
longer had the bug in question, there was nothing to do
other than to simply take the 'net-next' hunks.
Signed-off-by: David S. Miller <davem@davemloft.net>
Carolyn Wyborny [Tue, 8 Nov 2016 21:05:12 +0000 (13:05 -0800)]
i40e: change message to only appear when extra debug info is wanted
This patch changes an X722 informational message so that it only
appears when extra messages are desired. Without this patch,
on X722 devices, this message appears at load, potentially causing
unnecessary alarm.
Change-ID: I94f7aae15dc5b2723cc9728c630c72538a3e670e
Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 8 Nov 2016 21:05:11 +0000 (13:05 -0800)]
i40e/i40evf: replace for memcpy with single memcpy call in ethtool
memcpy replaced with single memcpy call in ethtool.
Change-ID: I3f5bef6bcc593412c56592c6459784db41575a0a
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 8 Nov 2016 21:05:10 +0000 (13:05 -0800)]
i40e: set broadcast promiscuous mode for each active VLAN
A previous workaround added to ensure receipt of all broadcast frames
incorrectly set the broadcast promiscuous mode unconditionally
regardless of active VLAN status.
Replace this partial workaround with a complete solution that sets the
broadcast promiscuous filters in i40e_sync_vsi_filters. This new method
sets the promiscuous mode based on when broadcast filters are added or
removed.
I40E_VLAN_ANY will request a broadcast filter for all VLANs, (as we're
in untagged mode) while a broadcast filter on a specific VLAN will only
request broadcast for that VLAN.
Thus, we restore addition of broadcast filter to the array, but we add
special handling for these such that they enable the broadcast
promiscuous mode instead of being sent as regular filters.
The end result is that we will correctly receive all broadcast packets
(even those with a *source* address equal to the broadcast address) but
will not receive packets for which we don't have an active VLAN filter.
Change-ID: I7d0585c5cec1a5bf55bf533b42e5e817d5db6a2d
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Harshitha Ramamurthy [Tue, 8 Nov 2016 21:05:09 +0000 (13:05 -0800)]
i40e: Fix for ethtool Supported link modes
This patch fixes the problem where the ethtool Supported link
modes list backplane interfaces on X722 devices for 10GbE with
SFP+ and Cortina retimer. This patch fixes the problem by setting
and using a flag for this particular device since the backplane
interface is only between the internal PHY and the retimer and it
should not be seen by the user as they cannot use it.
Without this patch, the user wrongly thinks that backplane interfaces
are supported on their device when they actually are not.
Change-ID: I3882bc2928431d48a2db03a51a713a1f681a79e9
Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 8 Nov 2016 21:05:08 +0000 (13:05 -0800)]
i40evf: protect against NULL msix_entries and q_vectors pointers
Update the functions which free msix_entries and q_vectors so that they
are safe against NULL values. This allows calling code to not care
whether these have already been freed when disabling and freeing them.
Change-ID: I31bfd1c0da18023d971b618edc6fb049721f3298
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Henry Tieman [Tue, 8 Nov 2016 21:05:07 +0000 (13:05 -0800)]
i40e: Pass unknown PHY type for unknown PHYs
The PHY type value for unrecognized PHYs and cables was changed
based on firmware version number. Newer hardware use lower firmware
version numbers and this was causing some PHYs to be identified
as type 0x16 instead of 0xe (unknown).
Without this patch, newer card will incorrectly identify unknown
PHYs and cables.
This change adds hardware type to the check for firmware version
so the PHY type is reported correctly.
Change-ID: I0723cbfd263c76fc73ff1a5275d1639051376c9a
Signed-off-by: Henry Tieman <henry.w.tieman@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>