Florian Westphal [Mon, 17 Aug 2015 16:09:55 +0000 (18:09 +0200)]
netfilter: nft_payload: work around vlan header stripping
make payload expression aware of the fact that VLAN offload may have
removed a vlan header.
When we encounter tagged skb, transparently insert the tag into the
register so that vlan header matching can work without userspace being
aware of offload features.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Daniel Borkmann [Fri, 14 Aug 2015 14:03:40 +0000 (16:03 +0200)]
netfilter: nf_conntrack: add efficient mark to zone mapping
This work adds the possibility of deriving the zone id from the skb->mark
field in a scalable manner. This allows for having only a single template
serving hundreds/thousands of different zones, for example, instead of the
need to have one match for each zone as an extra CT jump target.
Note that we'd need to have this information attached to the template as at
the time when we're trying to lookup a possible ct object, we already need
to know zone information for a possible match when going into
__nf_conntrack_find_get(). This work provides a minimal implementation for
a possible mapping.
In order to not add/expose an extra ct->status bit, the zone structure has
been extended to carry a flag for deriving the mark.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Daniel Borkmann [Fri, 14 Aug 2015 14:03:39 +0000 (16:03 +0200)]
netfilter: nf_conntrack: add direction support for zones
This work adds a direction parameter to netfilter zones, so identity
separation can be performed only in original/reply or both directions
(default). This basically opens up the possibility of doing NAT with
conflicting IP address/port tuples from multiple, isolated tenants
on a host (e.g. from a netns) without requiring each tenant to NAT
twice resp. to use its own dedicated IP address to SNAT to, meaning
overlapping tuples can be made unique with the zone identifier in
original direction, where the NAT engine will then allocate a unique
tuple in the commonly shared default zone for the reply direction.
In some restricted, local DNAT cases, also port redirection could be
used for making the reply traffic unique w/o requiring SNAT.
The consensus we've reached and discussed at NFWS and since the initial
implementation [1] was to directly integrate the direction meta data
into the existing zones infrastructure, as opposed to the ct->mark
approach we proposed initially.
As we pass the nf_conntrack_zone object directly around, we don't have
to touch all call-sites, but only those, that contain equality checks
of zones. Thus, based on the current direction (original or reply),
we either return the actual id, or the default NF_CT_DEFAULT_ZONE_ID.
CT expectations are direction-agnostic entities when expectations are
being compared among themselves, so we can only use the identifier
in this case.
Note that zone identifiers can not be included into the hash mix
anymore as they don't contain a "stable" value that would be equal
for both directions at all times, f.e. if only zone->id would
unconditionally be xor'ed into the table slot hash, then replies won't
find the corresponding conntracking entry anymore.
If no particular direction is specified when configuring zones, the
behaviour is exactly as we expect currently (both directions).
Support has been added for the CT netlink interface as well as the
x_tables raw CT target, which both already offer existing interfaces
to user space for the configuration of zones.
Below a minimal, simplified collision example (script in [2]) with
netperf sessions:
+--- tenant-1 ---+ mark := 1
| netperf |--+
+----------------+ | CT zone := mark [ORIGINAL]
[ip,sport] := X +--------------+ +--- gateway ---+
| mark routing |--| SNAT |-- ... +
+--------------+ +---------------+ |
+--- tenant-2 ---+ | ~~~|~~~
| netperf |--+ +-----------+ |
+----------------+ mark := 2 | netserver |------ ... +
[ip,sport] := X +-----------+
[ip,port] := Y
On the gateway netns, example:
iptables -t raw -A PREROUTING -j CT --zone mark --zone-dir ORIGINAL
iptables -t nat -A POSTROUTING -o <dev> -j SNAT --to-source <ip> --random-fully
iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark
iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark
conntrack dump from gateway netns:
netperf -H 10.1.1.2 -t TCP_STREAM -l60 -p12865,5555 from each tenant netns
tcp 6 431995 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=1
src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=1024
[ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp 6 431994 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=2
src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=5555
[ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp 6 299 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=39438 dport=33768 zone-orig=1
src=10.1.1.2 dst=10.1.1.1 sport=33768 dport=39438
[ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp 6 300 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=32889 dport=40206 zone-orig=2
src=10.1.1.2 dst=10.1.1.1 sport=40206 dport=32889
[ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=2
Taking this further, test script in [2] creates 200 tenants and runs
original-tuple colliding netperf sessions each. A conntrack -L dump in
the gateway netns also confirms 200 overlapping entries, all in ESTABLISHED
state as expected.
I also did run various other tests with some permutations of the script,
to mention some: SNAT in random/random-fully/persistent mode, no zones (no
overlaps), static zones (original, reply, both directions), etc.
[1] http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/57412/
[2] https://paste.fedoraproject.org/242835/
65657871/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Daniel Borkmann [Sat, 8 Aug 2015 19:40:01 +0000 (21:40 +0200)]
netfilter: nf_conntrack: push zone object into functions
This patch replaces the zone id which is pushed down into functions
with the actual zone object. It's a bigger one-time change, but
needed for later on extending zones with a direction parameter, and
thus decoupling this additional information from all call-sites.
No functional changes in this patch.
The default zone becomes a global const object, namely nf_ct_zone_dflt
and will be returned directly in various cases, one being, when there's
f.e. no zoning support.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Andreas Schultz [Wed, 5 Aug 2015 15:51:45 +0000 (17:51 +0200)]
netfilter: nfacct: per network namespace support
- Move the nfnl_acct_list into the network namespace, initialize
and destroy it per namespace
- Keep track of refcnt on nfacct objects, the old logic does not
longer work with a per namespace list
- Adjust xt_nfacct to pass the namespace when registring objects
Signed-off-by: Andreas Schultz <aschultz@tpip.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Wed, 5 Aug 2015 10:38:44 +0000 (12:38 +0200)]
netfilter: nft_limit: add per-byte limiting
This patch adds a new NFTA_LIMIT_TYPE netlink attribute to indicate the type of
limiting.
Contrary to per-packet limiting, the cost is calculated from the packet path
since this depends on the packet length.
The burst attribute indicates the number of bytes in which the rate can be
exceeded.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 2 Aug 2015 12:24:24 +0000 (14:24 +0200)]
netfilter: nft_limit: constant token cost per packet
The cost per packet can be calculated from the control plane path since this
doesn't ever change.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 2 Aug 2015 16:02:14 +0000 (18:02 +0200)]
netfilter: nft_limit: add burst parameter
This patch adds the burst parameter. This burst indicates the number of packets
that can exceed the limit.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 2 Aug 2015 12:16:42 +0000 (14:16 +0200)]
netfilter: nft_limit: factor out shared code with per-byte limiting
This patch prepares the introduction of per-byte limiting.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Fri, 31 Jul 2015 12:10:22 +0000 (14:10 +0200)]
netfilter: nft_limit: convert to token-based limiting at nanosecond granularity
Rework the limit expression to use a token-based limiting approach that refills
the bucket gradually. The tokens are calculated at nanosecond granularity
instead jiffies to improve precision.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Fri, 31 Jul 2015 12:16:51 +0000 (14:16 +0200)]
netfilter: nft_limit: rename to nft_limit_pkts
To prepare introduction of bytes ratelimit support.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 31 May 2015 16:04:11 +0000 (18:04 +0200)]
netfilter: nf_tables: add nft_dup expression
This new expression uses the nf_dup engine to clone packets to a given gateway.
Unlike xt_TEE, we use an index to indicate output interface which should be
fine at this stage.
Moreover, change to the preemtion-safe this_cpu_read(nf_skb_duplicated) from
nf_dup_ipv{4,6} to silence a lockdep splat.
Based on the original tee expression from Arturo Borrero Gonzalez, although
this patch has diverted quite a bit from this initial effort due to the
change to support maps.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 31 May 2015 15:54:44 +0000 (17:54 +0200)]
netfilter: factor out packet duplication for IPv4/IPv6
Extracted from the xtables TEE target. This creates two new modules for IPv4
and IPv6 that are shared between the TEE target and the new nf_tables dup
expressions.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Wed, 1 Jul 2015 14:38:10 +0000 (16:38 +0200)]
netfilter: xt_TEE: get rid of WITH_CONNTRACK definition
Use IS_ENABLED(CONFIG_NF_CONNTRACK) instead.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Mon, 8 Jun 2015 12:42:40 +0000 (14:42 +0200)]
netfilter: nft_counter: convert it to use per-cpu counters
This patch converts the existing seqlock to per-cpu counters.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jason A. Donenfeld [Tue, 4 Aug 2015 16:26:19 +0000 (18:26 +0200)]
net_dbg_ratelimited: turn into no-op when !DEBUG
The pr_debug family of functions turns into a no-op when -DDEBUG is not
specified, opting instead to call "no_printk", which gets compiled to a
no-op (but retains gcc's nice warnings about printf-style arguments).
The problem with net_dbg_ratelimited is that it is defined to be a
variant of net_ratelimited_function, which expands to essentially:
if (net_ratelimit())
pr_debug(fmt, ...);
When DEBUG is not defined, then this becomes,
if (net_ratelimit())
;
This seems benign, except it isn't. Firstly, there's the obvious
overhead of calling net_ratelimit needlessly, which does quite some book
keeping for the rate limiting. Given that the pr_debug and
net_dbg_ratelimited family of functions are sprinkled liberally through
performance critical code, with developers assuming they'll be compiled
out to a no-op most of the time, we certainly do not want this needless
book keeping. Secondly, and most visibly, even though no debug message
is printed when DEBUG is not defined, if there is a flood of
invocations, dmesg winds up peppered with messages such as
"net_ratelimit: 320 callbacks suppressed". This is because our
aforementioned net_ratelimit() function actually prints this text in
some circumstances. It's especially odd to see this when there isn't any
other accompanying debug message.
So, in sum, it doesn't make sense to have this function's current
behavior, and instead it should match what every other debug family of
functions in the kernel does with !DEBUG -- nothing.
This patch replaces calls to net_dbg_ratelimited when !DEBUG with
no_printk, keeping with the idiom of all the other debug print helpers.
Also, though not strictly neccessary, it guards the call with an if (0)
so that all evaluation of any arguments are sure to be compiled out.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 4 Aug 2015 13:36:24 +0000 (06:36 -0700)]
af_mpls: add null dev check in find_outdev
This patch adds null dev check for the 'cfg->rc_via_table ==
NEIGH_LINK_TABLE or dev_get_by_index() failed' case
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 7 Aug 2015 05:02:32 +0000 (22:02 -0700)]
Merge branch 'test-bpf-next'
Nicolas Schichan says:
====================
test_bpf improvements
Please find below the patch series with my latest changes to test_bpf.
The first patch checks for unexpected NULL generated skbs before
running the filter.
The second patch adds the possibility for tests to generate fragmented
skbs.
The third patch tests LD_ABS and LD_IND on fragmented skbs.
The fourth patch adds the possibility to restrict the tests being run
by specifying the name/id/range of the test(s) to run via module
parameters.
The fifth patch tests LD_ABS and LD_IND on non fragmented skbs with
various sizes and alignments.
The sixth and final patch checks that the interpreter or JIT correctly
resets A and X to 0.
This serie is against today's net-next tree.
Changes in V2:
* move declaration of 'ptr' in if() block in patch 2/6.
* fix various typos in patch 4/6
* rework default init of test_range array and cleanup exclude_test()
return condition in patch 4/6.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Schichan [Tue, 4 Aug 2015 13:19:12 +0000 (15:19 +0200)]
test_bpf: add tests checking that JIT/interpreter sets A and X to 0.
It is mandatory for the JIT or interpreter to reset the A and X
registers to 0 before running the filter. Check that it is the case on
various ALU and JMP instructions.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Schichan [Tue, 4 Aug 2015 13:19:11 +0000 (15:19 +0200)]
test_bpf: add more tests for LD_ABS and LD_IND.
This exerces the LD_ABS and LD_IND instructions for various sizes and
alignments. This also checks that X when used as an offset to a
BPF_IND instruction first in a filter is correctly set to 0.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Schichan [Tue, 4 Aug 2015 13:19:10 +0000 (15:19 +0200)]
test_bpf: add module parameters to filter the tests to run.
When developping on the interpreter or a particular JIT, it can be
interesting to restrict the tests list to a specific test or a
particular range of tests.
This patch adds the following module parameters to the test_bpf module:
* test_name=<string>: only the specified named test will be run.
* test_id=<number>: only the test with the specified id will be run
(see the output of test_bpf without parameters to get the test id).
* test_range=<number>,<number>: only the tests within IDs in the
specified id range are run (see the output of test_bpf without
parameters to get the test ids).
Any invalid range, test id or test name will result in -EINVAL being
returned and no tests being run.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Schichan [Tue, 4 Aug 2015 13:19:09 +0000 (15:19 +0200)]
test_bpf: test LD_ABS and LD_IND instructions on fragmented skbs.
These new tests exercise various load sizes and offsets crossing the
head/fragment boundary.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Schichan [Tue, 4 Aug 2015 13:19:08 +0000 (15:19 +0200)]
test_bpf: allow tests to specify an skb fragment.
This introduce a new test->aux flag (FLAG_SKB_FRAG) to tell the
populate_skb() function to add a fragment to the test skb containing
the data specified in test->frag_data).
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Schichan [Tue, 4 Aug 2015 13:19:07 +0000 (15:19 +0200)]
test_bpf: avoid oopsing the kernel when generate_test_data() fails.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 7 Aug 2015 05:00:59 +0000 (22:00 -0700)]
Merge branch 'mlx5e-next'
Amir Vadai says:
====================
net/mlx5e: Driver updates 04-Aug-2015
This patchset introduces two features to the ConnectX-4 driver: Patch 8/8
("Support physical port counters") exposes some hardware counters through
ethtool. Rest of the patches are preparation and usage of what we call
light-weight netdev open/close. Some flows that used to be in the ndo_open/stop
are moved to the PCI probe/remove flows - i.e. we will make the netdev
open/close operations more "light-weight".
The benefits of this change are:
1) Reduce the execution time of the stop/open operations.
2) Avoid saving SW shadows of resource configurations that must
persist through stop/open operations (e.g flow table steering
rules), and avoid deleting/applying them from/to the device upon
netdev stop/open.
3) Avoid synchronizing threads that access those resources with the
netdev stop/open threads.
Instead of create/destroy the resource during netdev open/stop, This patchset
changes the behavior such that upon netdev stop, traffic is redirected to a
"Drop RQ" (a RQ that silently drops, at the NIC HW level all incoming traffic).
After redirecting the traffic, RX/TX software resources could be destroyed.
During netdev open, the RX/TX rings are created and traffic is redirected to
the RX rings.
Patchset was applied and tested over commit
ba7591d ("ebpf: add skb->hash to
offset map for usage in {cls, act}_bpf or filters")
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Gal Pressman [Tue, 4 Aug 2015 11:05:47 +0000 (14:05 +0300)]
net/mlx5_core: Support physical port counters
Added physical port counters in the following standard formats to
ethtool statistics:
- IEEE 802.3
- RFC2863
- RFC2819
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Achiad Shochat [Tue, 4 Aug 2015 11:05:46 +0000 (14:05 +0300)]
net/mlx5e: Take advantage of the light-weight netdev open/stop
Now that TIRs, TISs and flow tables are kept alive while the netdev is
stopped (after executing ndo_stop()) we can do the following
improvements:
- Obsolete the active_vlans SW shadow.
- Do not delete/add flow table rules upon ndo_stop/open.
In addition to simplifying the flow, this change also fastens
the ndo_open/close operations.
- Obsolete synchronization of threads accessing the flow tables
with the netdev stop/open threads.
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Achiad Shochat [Tue, 4 Aug 2015 11:05:45 +0000 (14:05 +0300)]
net/mlx5e: Disable async events before unregister_netdev()
It does not make sense to allow events while the netdev is
unregistered.
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Achiad Shochat [Tue, 4 Aug 2015 11:05:44 +0000 (14:05 +0300)]
net/mlx5e: Rename/move functions following the ndo_stop flow change
Rename some functions that used to be invoked upon ndo_open/stop and
are now invoked upon create/destroy_netdev() in order to better hint
their place in the flow.
Change some functions location in the file so that functions involved
in ndo_open/stop flow will not be interleaved with other functions.
This is a cosmetic change, no logical change here.
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Achiad Shochat [Tue, 4 Aug 2015 11:05:43 +0000 (14:05 +0300)]
net/mlx5e: Light-weight netdev open/stop
Create/destroy TIRs, TISs and flow tables upon PCI probe/remove rather
than upon the netdev ndo_open/stop.
Upon ndo_stop(), redirect all RX traffic to the (lately introduced)
"Drop RQ" and then close only the RX/TX rings, leaving the TIRs,
TISs and flow tables alive.
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Achiad Shochat [Tue, 4 Aug 2015 11:05:42 +0000 (14:05 +0300)]
net/mlx5_core: Introduce access function to modify RSS/LRO params
To be used by the mlx5 Eth driver in following commit.
This is in preparation for netdev "light-weight" open/stop flow
change described in previous commit.
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Achiad Shochat [Tue, 4 Aug 2015 11:05:41 +0000 (14:05 +0300)]
net/mlx5e: Introduce the "Drop RQ"
RX traffic routed to this RQ will be silently dropped, at the NIC HW
level.
This is in preparation for netdev "light-weight" open/stop flow
change described in previous commit.
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Achiad Shochat [Tue, 4 Aug 2015 11:05:40 +0000 (14:05 +0300)]
net/mlx5e: Unify the RX flow
Generally an RX packet flows through the following objects:
Flow table --> TIR --> RQT --> RQ
Where:
- TIR stands for "Transport Interface Receive", defining the RSS and
LRO paramaters.
- RQT stands for "RQ Table", implementing the RSS indirection table.
- RQ stands for "Receive Queue"
For flows that do not need LRO, nor RSS, the driver made a shortcut to
the above RX flow by pointing to the RQ directly from the TIR, yielding
this flow:
Flow table --> TIR --> RQ
In this commit we remove this shortcut by "inserting" a single-RQ RQT
between the TIR and the RQ, i.e RX packets will reach the same RQ but
will go through an RQT of size 1, pointing to just a single RQ.
This way the RX traffic re-direction to/from the "Drop RQ" will be more
uniform (AKA "one flow"), as it will involve only RQTs re-direction and
no TIRs re-direction.
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 7 Aug 2015 04:59:27 +0000 (21:59 -0700)]
Merge branch 'cpsw-next'
Mugunthan V N says:
====================
CPSW interrupt handling cleanup and performance improvement
This patch series removes the irq controller disable interrupt and
adding a napi for tx event handling which improves the performance by
~180Mbps on dra7-evm
[ 5] local 192.168.10.116 port 5001 connected with 192.168.10.165 port 44176
[ 5] 0.0-60.0 sec 1.48 GBytes 210 Mbits/sec
[ 4] local 192.168.10.116 port 5001 connected with 192.168.10.165 port 33257
[ 4] 0.0-60.0 sec 2.71 GBytes 386 Mbits/sec
Changes from initial version:
* Added a patch to have napi only for first interface as there is
no use of having seperate napis for each interface as the
interrupt is shared by both interface and only one napi is
scheduled for each interrupt.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mugunthan V N [Tue, 4 Aug 2015 10:36:20 +0000 (16:06 +0530)]
drivers: net: cpsw: add separate napi for tx
Instead of processing tx events in isr adding separate napi for
tx which improves performance by ~180Mbps with
omap2plus_defconfig on DRA74x platform. Also cleaning up rx napis
by renaming to napi_rx for better understanding the code.
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mugunthan V N [Tue, 4 Aug 2015 10:36:19 +0000 (16:06 +0530)]
drivers: net: cpsw: dual_emac: simplify napi usage
Since interrupt is shared between the two ethernet interface and
in isr only one napi is scheduled at an instance so having two
napis doesn't make any difference. So making napi also as a
common resource for the dual ethernet interfaces.
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mugunthan V N [Tue, 4 Aug 2015 10:36:18 +0000 (16:06 +0530)]
drivers: net: cpsw: remove disable_irq/enable_irq as irq can be masked from cpsw itself
CPSW interrupts can be disabled by masking CPSW interrupts and
clearing interrupt by writing appropriate EOI. So removing all
disable_irq/enable_irq as discussed in [1]
[1] http://patchwork.ozlabs.org/patch/492741/
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Tue, 4 Aug 2015 07:44:22 +0000 (10:44 +0300)]
mpls: small cleanup in inet/inet6_fib_lookup_dev()
We recently changed this code from returning NULL to returning ERR_PTR.
There are some left over NULL assignments which we can remove. We can
preserve the error code from ip_route_output() instead of always
returning -ENODEV. Also these functions use a mix of gotos and direct
returns. There is no cleanup necessary so I changed the gotos to
direct returns.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 7 Aug 2015 04:54:13 +0000 (21:54 -0700)]
Merge branch 'bnx2x-cnic-bnx2fc-bd-support'
Yuval Mintz says:
====================
bnx2x, cnic, bnx2fc: add support for BD
Commit
230d00eb4bfe ("bnx2x: new Multi-function mode - BD") added support
for a new multi-function mode, but it added only the support required by
bnx2x for L2 interfaces.
This adds the required changes to support the new multi-function mode in
the offloaded storage protocols.
Dave,
Please consider applying this series to `net-next'.
Do notice that this involves non-networking driver changes -
but sending this as a single series seemed like the best approach as
we had to have bnx2x changes to support the new functionality.
If this is problematic, please tell us what's the preferred solution here.
Changes from previous versions
------------------------------
- From v1 - no actual changes; v1 failed to reach netdev so in order to
keep things in line I've termed this one v2.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Carnuccio [Tue, 4 Aug 2015 06:37:30 +0000 (09:37 +0300)]
bnx2fc: Read npiv table from nvram and create vports.
Signed-off-by: Joe Carnuccio <joe.carnuccio@qlogic.com>
Signed-off-by: Chad Dupuis <chad.dupuis@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuval Mintz [Tue, 4 Aug 2015 06:37:29 +0000 (09:37 +0300)]
bnx2x: Add BD support for storage
Commit
230d00eb4bfe ("bnx2x: new Multi-function mode - BD") adds support
for the new mode in bnx2x. This expands this support by implementing
APIs required by our storage drivers to support that mode.
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adheer Chandravanshi [Tue, 4 Aug 2015 06:37:28 +0000 (09:37 +0300)]
cnic: Add the interfaces to get FC-NPIV table.
Signed-off-by: Adheer Chandravanshi <adheer.chandravanshi@qlogic.com>
Signed-off-by: Chad Dupuis <chad.dupuis@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tej Parkash [Tue, 4 Aug 2015 06:37:27 +0000 (09:37 +0300)]
cnic: Populate upper layer driver state in MFW
Signed-off-by: Tej Parkash <tej.parkash@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Tue, 4 Aug 2015 05:31:18 +0000 (22:31 -0700)]
rocker: use netdev_err after register_netdev
After successful register_netdev, we can use netdev_err rather the more
generic dev_err.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Tue, 4 Aug 2015 05:31:17 +0000 (22:31 -0700)]
rocker: NULL port if port probe fails
Set port to NULL if port probe fails so we don't try to remove partially
initialized port on port probe err cleanup path.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 5 Aug 2015 06:57:45 +0000 (23:57 -0700)]
Merge git://git./linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, they are:
1) A couple of cleanups for the netfilter core hook from Eric Biederman.
2) Net namespace hook registration, also from Eric. This adds a dependency with
the rtnl_lock. This should be fine by now but we have to keep an eye on this
because if we ever get the per-subsys nfnl_lock before rtnl we have may
problems in the future. But we have room to remove this in the future by
propagating the complexity to the clients, by registering hooks for the init
netns functions.
3) Update nf_tables to use the new net namespace hook infrastructure, also from
Eric.
4) Three patches to refine and to address problems from the new net namespace
hook infrastructure.
5) Switch to alternate jumpstack in xtables iff the packet is reentering. This
only applies to a very special case, the TEE target, but Eric Dumazet
reports that this is slowing down things for everyone else. So let's only
switch to the alternate jumpstack if the tee target is in used through a
static key. This batch also comes with offline precalculation of the
jumpstack based on the callchain depth. From Florian Westphal.
6) Minimal SCTP multihoming support for our conntrack helper, from Michal
Kubecek.
7) Reduce nf_bridge_info per skbuff scratchpad area to 32 bytes, from Florian
Westphal.
8) Fix several checkpatch errors in bridge netfilter, from Bernhard Thaler.
9) Get rid of useless debug message in ip6t_REJECT, from Subash Abhinov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Subash Abhinov Kasiviswanathan [Thu, 30 Jul 2015 16:53:45 +0000 (16:53 +0000)]
netfilter: ip6t_REJECT: Remove debug messages from reject_tg6()
Make it similar to reject_tg() in ipt_REJECT.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
David S. Miller [Tue, 4 Aug 2015 08:24:35 +0000 (01:24 -0700)]
Merge branch 'cxgb4-next'
Hariprasad Shenai says:
====================
add meminfo, bist status and misc. fixes
This patch series adds the following.
Add support to dump memory address range of various hw modules
Add support to dump edc bist status during ecc error
Read correct bits of who am i register for T6 adapter
and update T6 register range
This patch series has been created against net-next tree and includes
patches on cxgb4 and cxgb4vf driver.
We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
V2: PATCH 3/4 ("cxgb4/cxgb4vf: read the correct bits of PL Who Am I
register") Fix switch statement in get_chip_type() and some more style
fixes based on review comment by Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Tue, 4 Aug 2015 09:06:20 +0000 (14:36 +0530)]
cxgb4: Update T6 register ranges
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Tue, 4 Aug 2015 09:06:19 +0000 (14:36 +0530)]
cxgb4/cxgb4vf: read the correct bits of PL Who Am I register
Read the correct bits of PL Who Am I for the Source PF field which has
changed in T6
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Tue, 4 Aug 2015 09:06:18 +0000 (14:36 +0530)]
cxgb4: Add support to dump edc bist status
Add support to dump edc bist status for ECC data errors
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Tue, 4 Aug 2015 09:06:17 +0000 (14:36 +0530)]
cxgb4: Add debugfs support to dump meminfo
Add debug support to dump memory address ranges of various hardware
modules of the adapter.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Shearman [Mon, 3 Aug 2015 16:50:04 +0000 (17:50 +0100)]
mpls: Use definition for reserved label checks
In multiple locations there are checks for whether the label in hand
is a reserved label or not using the arbritray value of 16. Factor
this out into a #define for better maintainability and for
documentation.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 4 Aug 2015 05:26:14 +0000 (22:26 -0700)]
Merge branch 'lwtunnel-encap-local'
Robert Shearman says:
====================
lwtunnel: encap locally-generated ipv4 packets
Locally-generated IPv4 packets, such as from applications running on
the host or traceroute/ping currently don't have lwtunnel output
redirected encap applied. However, they should do in the same way as
for forwarded packets and this patch series addresses that.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Shearman [Mon, 3 Aug 2015 16:39:21 +0000 (17:39 +0100)]
ipv4: apply lwtunnel encap for locally-generated packets
lwtunnel encap is applied for forwarded packets, but not for
locally-generated packets. This is because the output function is not
overridden in __mkroute_output, unlike it is in __mkroute_input.
The lwtunnel state is correctly set on the rth through the call to
rt_set_nexthop, so all that needs to be done is to override the dst
output function to be lwtunnel_output if there is lwtunnel state
present and it requires output redirection.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Shearman [Mon, 3 Aug 2015 16:39:20 +0000 (17:39 +0100)]
lwtunnel: set skb protocol and dev
In the locally-generated packet path skb->protocol may not be set and
this is required for the lwtunnel encap in order to get the lwtstate.
This would otherwise have been set by ip_output or ip6_output so set
skb->protocol prior to calling the lwtunnel encap
function. Additionally set skb->dev in case it is needed further down
the transmit path.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Mon, 3 Aug 2015 23:19:58 +0000 (01:19 +0200)]
bridge: mdb: fix vlan_enabled access when vlans are not configured
Instead of trying to access br->vlan_enabled directly use the provided
helper br_vlan_enabled().
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Mon, 3 Aug 2015 14:21:57 +0000 (16:21 +0200)]
act_bpf: properly support late binding of bpf action to a classifier
Since the introduction of the BPF action in
d23b8ad8ab23 ("tc: add BPF
based action"), late binding was not working as expected. I.e. setting
the action part for a classifier only via 'bpf index <num>', where <num>
is the index of an existing action, is being rejected by the kernel due
to other missing parameters.
It doesn't make sense to require these parameters such as BPF opcodes
etc, as they are not going to be used anyway: in this case, they're just
allocated/parsed and then freed again w/o doing anything meaningful.
Instead, parse and verify the remaining parameters *after* the test on
tcf_hash_check(), when we really know that we're dealing with creation
of a new action or replacement of an existing one and where late binding
is thus irrelevant.
After patch, test case is now working:
FOO="1,6 0 0
4294967295,"
tc actions add action bpf bytecode "$FOO"
tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action bpf index 1
tc actions show action bpf
action order 0: bpf bytecode '1,6 0 0
4294967295' default-action pipe
index 1 ref 2 bind 1
tc filter show dev foo
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 flowid 1:1 bytecode '1,6 0 0
4294967295'
action order 1: bpf bytecode '1,6 0 0
4294967295' default-action pipe
index 1 ref 2 bind 1
Late binding of a BPF action can be useful for preloading maps (e.g. before
they hit traffic) in case of eBPF programs, or to share a single eBPF action
with multiple classifiers.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Mon, 3 Aug 2015 13:17:44 +0000 (09:17 -0400)]
net: dsa: mv88e6xxx: call _mv88e6xxx_stats_wait with SMI lock held
At switch setup, _mv88e6xxx_stats_wait was called without holding the
SMI mutex. Fix this by requesting the lock for this call.
Also, return the _mv88e6xxx_stats_wait code, since it may fail.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Satish Ashok [Mon, 3 Aug 2015 11:29:16 +0000 (13:29 +0200)]
bridge: mdb: add/del entry on all vlans if vlan_filter is enabled and vid is 0
Before this patch when a vid was not specified, the entry was added with
vid 0 which is useless when vlan_filtering is enabled. This patch makes
the entry to be added on all configured vlans when vlan filtering is
enabled and respectively deleted from all, if the entry vid is 0.
This is also closer to the way fdb works with regard to vid 0 and vlan
filtering.
Example:
Setup:
$ bridge vlan add vid 256 dev eth4
$ bridge vlan add vid 1024 dev eth4
$ bridge vlan add vid 64 dev eth3
$ bridge vlan add vid 128 dev eth3
$ bridge vlan
port vlan ids
eth3 1 PVID Egress Untagged
64
128
eth4 1 PVID Egress Untagged
256
1024
$ echo 1 > /sys/class/net/br0/bridge/vlan_filtering
Before:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1
$ bridge mdb
dev br0 port eth3 grp 239.0.0.1 temp
After:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1
$ bridge mdb
dev br0 port eth3 grp 239.0.0.1 temp vid 1
dev br0 port eth3 grp 239.0.0.1 temp vid 128
dev br0 port eth3 grp 239.0.0.1 temp vid 64
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 3 Aug 2015 21:24:50 +0000 (14:24 -0700)]
Merge branch 'stacked-vlan-TSO'
Toshiaki Makita says:
====================
Stacked vlan TSO for virtual devices
Basically virtual devices do not need to segment double tagged packets.
This patch set adds TSO feature for double tagged packets to several
virtual devices, which can be realized by simply setting
.ndo_features_check to passthru_features_check.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Toshiaki Makita [Fri, 31 Jul 2015 06:03:27 +0000 (15:03 +0900)]
tuntap: Don't segment multiple tagged packets on tap device
Tap devices don't need to segment multiple tagged packets.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Toshiaki Makita [Fri, 31 Jul 2015 06:03:26 +0000 (15:03 +0900)]
bridge: Don't segment multiple tagged packets on bridge device
Bridge devices don't need to segment multiple tagged packets since thier
ports can segment them.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Toshiaki Makita [Fri, 31 Jul 2015 06:03:25 +0000 (15:03 +0900)]
veth: Don't segment multiple tagged packets on veth device
Veth devices don't need to segment multiple tagged packets.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Toshiaki Makita [Fri, 31 Jul 2015 06:03:24 +0000 (15:03 +0900)]
macvlan: Don't segment multiple tagged packets on macvlan device
Macvlan/macvtap devices don't need to segment multiple tagged packets
since the lower devices can segment them.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 31 Jul 2015 16:25:17 +0000 (18:25 +0200)]
virtio_net: add gro capability
Straightforward patch to add GRO processing to virtio_net.
napi_complete_done() usage allows more aggressive aggregation,
opted-in by setting /sys/class/net/xxx/gro_flush_timeout
Tested:
Setting /sys/class/net/xxx/gro_flush_timeout to 1000 nsec,
Rick Jones reported following results.
One VM of each on a pair of OpenStack compute nodes with E5-2650Lv3 CPUs
and Intel 82599ES-based NICs. So, two "before" and two "after" VMs.
The OpenStack compute nodes were running OpenStack Kilo, with VxLAN
encapsulation being used through OVS so no GRO coming-up the host
stack. The compute nodes themselves were running a 3.14-based kernel.
Single-stream netperf, CPU utilizations and thus service demands are
based on intra-guest reported CPU.
Throughput Mbit/s, bigger is better
Min Median Average Max
4.2.0-rc3+ 1364 1686 1678 1938
4.2.0-rc3+flush1k 1824 2269 2275 2647
Send Service Demand, smaller is better
Min Median Average Max
4.2.0-rc3+ 0.236 0.558 0.524 0.802
4.2.0-rc3+flush1k 0.176 0.503 0.471 0.738
Receive Service Demand, smaller is better.
Min Median Average Max
4.2.0-rc3+ 1.906 2.188 2.191 2.531
4.2.0-rc3+flush1k 0.448 0.529 0.533 0.692
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Rick Jones <rick.jones2@hp.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Sun, 2 Aug 2015 18:56:38 +0000 (20:56 +0200)]
rocker: linearize skb in case frags would not fit into tx descriptor
Suggested-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 2 Aug 2015 18:56:37 +0000 (20:56 +0200)]
rocker: enable support for scattered packets
rocker supports the transmission of scattered packets, so let the kernel
know about it by setting the NETIF_F_SG bit in the device's features.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 31 Jul 2015 22:46:29 +0000 (00:46 +0200)]
ebpf: add skb->hash to offset map for usage in {cls, act}_bpf or filters
Add skb->hash to the __sk_buff offset map, so it can be accessed from
an eBPF program. We currently already do this for classic BPF filters,
but not yet on eBPF, it might be useful as a demuxer in combination with
helpers like bpf_clone_redirect(), toy example:
__section("cls-lb") int ingress_main(struct __sk_buff *skb)
{
unsigned int which = 3 + (skb->hash & 7);
/* bpf_skb_store_bytes(skb, ...); */
/* bpf_l{3,4}_csum_replace(skb, ...); */
bpf_clone_redirect(skb, which, 0);
return -1;
}
I was thinking whether to add skb_get_hash(), but then concluded the
raw skb->hash seems fine in this case: we can directly access the hash
w/o extra eBPF helper function call, it's filled out by many NICs on
ingress, and in case the entropy level would not be sufficient, people
can still implement their own specific sw fallback hash mix anyway.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuval Mintz [Sun, 2 Aug 2015 09:42:41 +0000 (12:42 +0300)]
bnx2x: Correct logic for pvid configuration.
Commit
05cc5a39ddb7 ("bnx2x: add vlan filtering offload") has introduced
an incorrect logic for checking whether pvid should be configured for
a vf, causing the hypervisor driver to send unneeded ramrods for all of
the vfs each time a pvid has changed.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 1 Aug 2015 06:52:20 +0000 (23:52 -0700)]
Merge git://git./linux/kernel/git/davem/net
Conflicts:
arch/s390/net/bpf_jit_comp.c
drivers/net/ethernet/ti/netcp_ethss.c
net/bridge/br_multicast.c
net/ipv4/ip_fragment.c
All four conflicts were cases of simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 1 Aug 2015 00:10:56 +0000 (17:10 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Must teardown SR-IOV before unregistering netdev in igb driver, from
Alex Williamson.
2) Fix ipv6 route unreachable crash in IPVS, from Alex Gartrell.
3) Default route selection in ipv4 should take the prefix length, table
ID, and TOS into account, from Julian Anastasov.
4) sch_plug must have a reset method in order to purge all buffered
packets when the qdisc is reset, likewise for sch_choke, from WANG
Cong.
5) Fix deadlock and races in slave_changelink/br_setport in bridging.
From Nikolay Aleksandrov.
6) mlx4 bug fixes (wrong index in port even propagation to VFs,
overzealous BUG_ON assertion, etc.) from Ido Shamay, Jack
Morgenstein, and Or Gerlitz.
7) Turn off klog message about SCTP userspace interface compat that
makes no sense at all, from Daniel Borkmann.
8) Fix unbounded restarts of inet frag eviction process, causing NMI
watchdog soft lockup messages, from Florian Westphal.
9) Suspend/resume fixes for r8152 from Hayes Wang.
10) Fix busy loop when MSG_WAITALL|MSG_PEEK is used in TCP recv, from
Sabrina Dubroca.
11) Fix performance regression when removing a lot of routes from the
ipv4 routing tables, from Alexander Duyck.
12) Fix device leak in AF_PACKET, from Lars Westerhoff.
13) AF_PACKET also has a header length comparison bug due to signedness,
from Alexander Drozdov.
14) Fix bug in EBPF tail call generation on x86, from Daniel Borkmann.
15) Memory leaks, TSO stats, watchdog timeout and other fixes to
thunderx driver from Sunil Goutham and Thanneeru Srinivasulu.
16) act_bpf can leak memory when replacing programs, from Daniel
Borkmann.
17) WOL packet fixes in gianfar driver, from Claudiu Manoil.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (79 commits)
stmmac: fix missing MODULE_LICENSE in stmmac_platform
gianfar: Enable device wakeup when appropriate
gianfar: Fix suspend/resume for wol magic packet
gianfar: Fix warning when CONFIG_PM off
act_pedit: check binding before calling tcf_hash_release()
net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket
net: sched: fix refcount imbalance in actions
r8152: reset device when tx timeout
r8152: add pre_reset and post_reset
qlcnic: Fix corruption while copying
act_bpf: fix memory leaks when replacing bpf programs
net: thunderx: Fix for crash while BGX teardown
net: thunderx: Add PCI driver shutdown routine
net: thunderx: Fix crash when changing rss with mutliple traffic flows
net: thunderx: Set watchdog timeout value
net: thunderx: Wakeup TXQ only if CQE_TX are processed
net: thunderx: Suppress alloc_pages() failure warnings
net: thunderx: Fix TSO packet statistic
net: thunderx: Fix memory leak when changing queue count
net: thunderx: Fix RQ_DROP miscalculation
...
David S. Miller [Sat, 1 Aug 2015 00:07:12 +0000 (17:07 -0700)]
Merge branch 'ipv6-auto-flow-labels'
Tom Herbert says:
====================
ipv6: Turn on auto IPv6 flow labels by default
BSD (MacOS) has already turned on flow labels by default and this does
not seem to be causing any problems in the Internet. Let's go ahead
and turn them on by default. We'll continue to monitor for any devices
start choking on them.
Flow labels are important since they are the desired solution for
network devices to perform ECMP and RSS (RFC6437 and RFC6438).
Traditionally, devices perform a 5-tuple hash on packets that
includes port numbers. For the most part, these devices can only
compute 5-tuple hashes for TCP and UDP. This severely limits our ability
to get good network load balancing for other protocols (IPIP, GRE,ESP,
etc.), and hence we are limited in using other protocols. Unfortunately,
this method is accepted as the de facto standard to the extent that
there are several proposals to encapsulate protocols in UDP _just_ for
the purposes for getting ECMP to work. With hosts generating flow labels
and devices taking them as input into ECMP (several already do), we can
start to fix this fundamental problem.
This patch set:
- Changes IPV6_FLOWINFO sockopt to be opt-out of flow labels for
connections rather than opt-in
- Disable flow label state ranges sysctl by default
- Enable auto flow labels sysctl by default
v2:
- Added functions to create an skb->hash based on flowi4 and flowi6.
These are called in output path when creating a packet
- Call skb_get_hash_flowi6 in ip6_make_flowlabel
- Implement the auto_flowlabels sysctl as a mode for auto flowlabels.
There are four modes which correspond to flow labels being enabled
and whether socket option can be used to opt in or opt out of
using them
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 31 Jul 2015 23:52:14 +0000 (16:52 -0700)]
ipv6: Enable auto flow labels by default
Initialize auto_flowlabels to one. This enables automatic flow labels,
individual socket may disable them using the IPV6_AUTOFLOWLABEL socket
option.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 31 Jul 2015 23:52:13 +0000 (16:52 -0700)]
ipv6: Disable flowlabel state ranges by default
Per RFC6437 stateful flow labels (e.g. labels set by flow label manager)
cannot "disturb" nodes taking part in stateless flow labels. While the
ranges only reduce the flow label entropy by one bit, it is conceivable
that this might bias the algorithm on some routers causing a load
imbalance. For best results on the Internet we really need the full
20 bits.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 31 Jul 2015 23:52:12 +0000 (16:52 -0700)]
ipv6: Implement different admin modes for automatic flow labels
Change the meaning of net.ipv6.auto_flowlabels to provide a mode for
automatic flow labels generation. There are four modes:
0: flow labels are disabled
1: flow labels are enabled, sockets can opt-out
2: flow labels are allowed, sockets can opt-in
3: flow labels are enabled and enforced, no opt-out for sockets
np->autoflowlabel is initialized according to the sysctl value.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 31 Jul 2015 23:52:11 +0000 (16:52 -0700)]
ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel
We can't call skb_get_hash here since the packet is not complete to do
flow_dissector. Create hash based on flowi6 instead.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 31 Jul 2015 23:52:10 +0000 (16:52 -0700)]
net: Add functions to get skb->hash based on flow structures
Add skb_get_hash_flowi6 and skb_get_hash_flowi4 which derive an sk_buff
hash from flowi6 and flowi4 structures respectively. These functions
can be called when creating a packet in the output path where the new
sk_buff does not yet contain a fully formed packet that is parsable by
flow dissector.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 1 Aug 2015 00:05:37 +0000 (17:05 -0700)]
Merge branch 'for-linus-4.2' of git://git./linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"Filipe fixed up a hard to trigger ENOSPC regression from our merge
window pull, and we have a few other smaller fixes"
* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix quick exhaustion of the system array in the superblock
btrfs: its btrfs_err() instead of btrfs_error()
btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail
btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()
Linus Torvalds [Sat, 1 Aug 2015 00:00:25 +0000 (17:00 -0700)]
Merge tag 'sound-4.2-rc5' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This became a relative big update as it includes the collected ASoC
fixes. There are a few fixes in ASoC core side, mostly for DAPM and
the new topology API. The rest are various ASoC driver-specific
fixes, as well as the usual HD-audio and USB-audio quirks"
* tag 'sound-4.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (29 commits)
ALSA: hda - Fix MacBook Pro 5,2 quirk
ALSA: hda - Fix race between PM ops and HDA init/probe
ALSA: usb-audio: add dB range mapping for some devices
ALSA: hda - Apply a fixup to Dell Vostro 5480
ALSA: hda - Add pin quirk for the headset mic jack detection on Dell laptop
ALSA: hda - Apply fixup for another Toshiba Satellite S50D
ALSA: fireworks: add support for AudioFire2 quirk
ALSA: hda - Fix the headset mic that will not work on Dell desktop machine
ALSA: hda - fix cs4210_spdif_automute()
ASoC: pcm1681: Fix setting de-emphasis sampling rate selection
ASoC: ssm4567: Keep TDM_BCLKS in ssm4567_set_dai_fmt
ASoC: sgtl5000: Fix up define for SGTL5000_SMALL_POP
ASoC: dapm: Don't add prefix to widget stream name
ASoC: rt5645: Check if codec is initialized in workqueue handler
ASoC: Intel: Get correct usage_count value to load firmware
ASoC: topology: Fix to add dapm mixer info
ASoC: zx: spdif: Fix devm_ioremap_resource return value check
ASoC: zx: i2s: Fix devm_ioremap_resource return value check
ASoC: mediatek: Use platform_of_node for machine drivers
ASoC: Free card DAPM context on snd_soc_instantiate_card() error path
...
David S. Miller [Fri, 31 Jul 2015 22:45:37 +0000 (15:45 -0700)]
Merge branch 'dsa-netconsole'
Florian Fainelli says:
====================
net: GENET, SYSTEMPORT and DSA netconsole
This patch series adds support for netconsole in the GENET, SYSTEMPORT and DSA
drivers.
A small refactoring to the DSA transmit path is required to avoid duplicating
the dsa_netpoll_send_skb() into each and every tagging protocol supported.
Testing on e.g: mv643xx_eth and/or e1000e would be much appreciated!
Changes in v2:
- properly disable/enable interrupts in GENET and SYSTEMPORT
- pass the reallocated SKB back to dsa_slave_xmit() in case a tag protocol had to
alter the original SKB
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 31 Jul 2015 18:42:57 +0000 (11:42 -0700)]
net: dsa: Add netconsole support
Add support for using DSA slave network devices with netconsole, which
requires us to allocate and free custom netpoll instances and invoke the
parent network device poll controller callback.
In order for netconsole to work, we need to construct the DSA tag, but
not queue the skb for transmission on the master network device xmit
function.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 31 Jul 2015 18:42:56 +0000 (11:42 -0700)]
net: dsa: Refactor transmit path to eliminate duplication
All tagging protocols do the same thing: increment device statistics,
make room for the tag to be inserted, create the tag, invoke the parent
network device transmit function.
In order to prepare for adding netpoll support, which requires the tag
creation, but not using the parent network device transmit function, do
some little refactoring which eliminates duplication between the 4
tagging protocols supported.
We need to return a sk_buff pointer back to the caller because the tag
specific transmit function may have to reallocate the original skb (e.g:
tag_trailer.c) and this is the one we should be transmitting, not the
original sk_buff we were passed.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 31 Jul 2015 18:42:55 +0000 (11:42 -0700)]
net: systemport: Add netconsole support
Implement a poll controller for netconsole which invokes the RX
interrupt handler to poll for incoming packets, and cleans up all TX
queues by invoking the TX interrupt handler.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 31 Jul 2015 18:42:54 +0000 (11:42 -0700)]
net: bcmgenet: Add netconsole support
Implement a poll controller for netconsole which invokes both of our
interrupt handlers for the different RX/TX queues.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joachim Eastwood [Fri, 31 Jul 2015 17:13:22 +0000 (19:13 +0200)]
stmmac: fix missing MODULE_LICENSE in stmmac_platform
Commit
50649ab14982 ("stmmac: drop driver from stmmac platform code")
was a bit overzealous in removing code and dropped the MODULE_*
macro's that are still needed since stmmac_platform can be a module.
Fix this by putting the macro's remvoed in
50649ab14982 back.
This fixes the following errors when used as a module:
stmmac_platform: module license 'unspecified' taints kernel.
Disabling lock debugging due to kernel taint
stmmac_platform: Unknown symbol devm_kmalloc (err 0)
stmmac_platform: Unknown symbol stmmac_suspend (err 0)
stmmac_platform: Unknown symbol platform_get_irq_byname (err 0)
stmmac_platform: Unknown symbol stmmac_dvr_remove (err 0)
stmmac_platform: Unknown symbol platform_get_resource (err 0)
stmmac_platform: Unknown symbol of_get_phy_mode (err 0)
stmmac_platform: Unknown symbol of_property_read_u32_array (err 0)
stmmac_platform: Unknown symbol of_alias_get_id (err 0)
stmmac_platform: Unknown symbol stmmac_resume (err 0)
stmmac_platform: Unknown symbol stmmac_dvr_probe (err 0)
Fixes:
50649ab14982 ("stmmac: drop driver from stmmac platform code")
Reported-by: Igor Gnatenko <i.gnatenko.brain@gmail.com>
Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 31 Jul 2015 22:41:50 +0000 (15:41 -0700)]
Merge branch 'gianfar-wol-fixes'
Claudiu Manoil says:
====================
gianfar: wol magic packet fixes
These changes were already validated as part of FSL SDK.
Patch 2 fixes occasional wake-on magic packet failures during
traffic, probably due to incorrect traffic stop/ device halt
sequence and incorrect usage of txlock.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Claudiu Manoil [Fri, 31 Jul 2015 15:38:33 +0000 (18:38 +0300)]
gianfar: Enable device wakeup when appropriate
The wol_en flag is 0 by default anyway, and we have the
following inconsistency: a MAGIC packet wol capable eth
interface is registered as a wake-up source but unable
to wake-up the system as wol_en is 0 (wake-on flag set to 'd').
Calling set_wakeup_enable() at netdev open is just redundant
because wol_en is 0 by default.
Let only ethtool call set_wakeup_enable() for now.
The bflock is obviously obsoleted, its utility has been corroded
over time. The bitfield flags used today in gianfar are accessed
only on the init/ config path, with no real possibility of
concurrency - nothing that would justify smth. like bflock.
Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Claudiu Manoil [Fri, 31 Jul 2015 15:38:32 +0000 (18:38 +0300)]
gianfar: Fix suspend/resume for wol magic packet
If we disable NAPI in the first place we can mask the device's
interrupts (and halt it) without fearing that imask may be
concurrently accessed from interrupt context, so there's
no need to do local_irq_save() around gfar_halt_nodisable().
lock_rx_qs()/unlock_tx_qs() are just obsoleted and potentially
buggy routines. The txlock is currently used in the driver only
to manage TX congestion, it has nothing to do with halting the
device. With these changes, the TX processing is stopped before
gfar_halt().
Compact gfar_halt() is used instead of gfar_halt_nodisable(),
as it disables Rx/TX DMA h/w blocks and the Rx/TX h/w queues.
gfar_start() re-enables all these blocks on resume. Enabling
the magic-packet mode remains the same, note that the RX block
is re-enabled just before entering sleep mode.
Add IRQF_NO_SUSPEND flag for the error interrupt line, to signal
that the interrupt line must remain active during sleep in order
to wake the system by magic packet (MAG) reception interrupt.
(On some systems the MAG interrupt did trigger w/o this flag
as well, but on others it didn't.)
Without these fixes, when suspended during fair Tx traffic the
interface occasionally failed to be woken up by magic packet.
Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Claudiu Manoil [Fri, 31 Jul 2015 15:38:31 +0000 (18:38 +0300)]
gianfar: Fix warning when CONFIG_PM off
CC drivers/net/ethernet/freescale/gianfar.o
drivers/net/ethernet/freescale/gianfar.c:568:13: warning: 'lock_tx_qs'
defined but not used [-Wunused-function]
static void lock_tx_qs(struct gfar_private *priv)
^
drivers/net/ethernet/freescale/gianfar.c:576:13: warning: 'unlock_tx_qs'
defined but not used [-Wunused-function]
static void unlock_tx_qs(struct gfar_private *priv)
^
Reported-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Fri, 31 Jul 2015 14:49:43 +0000 (16:49 +0200)]
bonding: add tlb_dynamic_lb netlink support
tlb_dynamic_lb could be set only via sysfs, this patch allows it to be
set via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 31 Jul 2015 22:33:23 +0000 (15:33 -0700)]
Merge tag 'wireless-drivers-next-for-davem-2015-07-31' of git://git./linux/kernel/git/kvalo/wireless-drivers-next
Kalle Valo says:
====================
Major changes:
mwifiex:
* add TX DATA Pause support
* add multichannel and TDLS channel switch support
ath10k:
* enable VHT for IBSS
* initial work to support qca99x0 and the corresponding 10.4 firmware branch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Pieczko [Fri, 31 Jul 2015 10:15:22 +0000 (11:15 +0100)]
sfc: MC allocations must be restored following an entity reset
Signed-off-by: Shradha Shah <sshah@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Pieczko [Fri, 31 Jul 2015 10:14:54 +0000 (11:14 +0100)]
sfc: allow ethtool selftest and MC reboot to complete on an unprivileged function
The policy in the net driver is to attempt MCDI commands and
then handle any EPERM error codes appropriately when returned
by unprivileged functions.
The ethtool selftest contains some tests which are useful on
an unprivileged function, such as the event queue interrupt
tests, but other tests cannot be performed as the function
does not have the required permissions.
If a test returns -EPERM, act as though the test was not run
and continue.
Signed-off-by: Shradha Shah <sshah@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shaohui Xie [Fri, 31 Jul 2015 08:58:42 +0000 (16:58 +0800)]
net: phy: add driver for aquantia phy
This patch added driver to support Aquantia PHYs AQ1202, AQ2104, AQR105,
AQR405, which accessed through clause 45.
Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Fri, 31 Jul 2015 06:54:28 +0000 (23:54 -0700)]
br2684: Remove unnecessary formatting macros b1 and bs
Use vsprintf extension %pI4 instead.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Fri, 31 Jul 2015 03:23:39 +0000 (11:23 +0800)]
r8152: disable the capability of zero length
The UEFI driver would enable zero length, and the Linux driver doesn't
need it. Zero length let the hw complete the transfer with length 0,
when there is no received packet. It would add the load of USB host
controller and reduce the performance.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Fri, 31 Jul 2015 03:10:22 +0000 (20:10 -0700)]
vxlan: expose COLLECT_METADATA flag to user space
Two vxlan driver flags FLOWBASED and COLLECT_METADATA need to be set to
make use of its new flow mode. The former already exposed. Expose the latter.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Fri, 31 Jul 2015 00:12:21 +0000 (17:12 -0700)]
act_pedit: check binding before calling tcf_hash_release()
When we share an action within a filter, the bind refcnt
should increase, therefore we should not call tcf_hash_release().
Fixes:
1a29321ed045 ("net_sched: act: Dont increment refcnt on replace")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 31 Jul 2015 22:21:30 +0000 (15:21 -0700)]
Merge branch 'mpls-build-fix'
Roopa Prabhu says:
====================
af_mpls: fix undefined reference to ip6_route_output with CONFIG_IPV6=n
This patch series uses ipv6_stub_impl.ipv6_dst_lookup instead of
ip6_route_output. Follows the vxlan drivers usage of
ipv6_stub_impl.ipv6_dst_lookup.
There is no sk in the af_mpls context from where
ipv6_stub_impl.ipv6_dst_lookup is used. sk appears to be needed
to get the namespace 'net' and is optional otherwise. This patch series
changes ipv6_stub_impl.ipv6_dst_lookup to take net argument. sk remains
optional.
v1 - v2: use IS_BUILTIN
v2 - v3: Use new Kconfig option that depends on (IPV6 || IPV6=n) as
suggested by Dave. Also uses IS_ERR as suggested by Thomas.
v3 - v4: Include missed case of (MPLS_ROUTING=y && IPV6=m) reported by
Dave.
v4 - v5: Use ipv6_stub_impl.ipv6_dst_lookup as suggested by Hannes
v5 - v6: protect against null ipv6_stub by statically declaring
a ipv6_dst_lookup NOP func
====================
Signed-off-by: David S. Miller <davem@davemloft.net>