Patrick McHardy [Mon, 14 Oct 2013 09:00:02 +0000 (11:00 +0200)]
netfilter: add nftables
This patch adds nftables which is the intended successor of iptables.
This packet filtering framework reuses the existing netfilter hooks,
the connection tracking system, the NAT subsystem, the transparent
proxying engine, the logging infrastructure and the userspace packet
queueing facilities.
In a nutshell, nftables provides a pseudo-state machine with 4 general
purpose registers of 128 bits and 1 specific purpose register to store
verdicts. This pseudo-machine comes with an extensible instruction set,
a.k.a. "expressions" in the nftables jargon. The expressions included
in this patch provide the basic functionality, they are:
* bitwise: to perform bitwise operations.
* byteorder: to change from host/network endianess.
* cmp: to compare data with the content of the registers.
* counter: to enable counters on rules.
* ct: to store conntrack keys into register.
* exthdr: to match IPv6 extension headers.
* immediate: to load data into registers.
* limit: to limit matching based on packet rate.
* log: to log packets.
* meta: to match metainformation that usually comes with the skbuff.
* nat: to perform Network Address Translation.
* payload: to fetch data from the packet payload and store it into
registers.
* reject (IPv4 only): to explicitly close connection, eg. TCP RST.
Using this instruction-set, the userspace utility 'nft' can transform
the rules expressed in human-readable text representation (using a
new syntax, inspired by tcpdump) to nftables bytecode.
nftables also inherits the table, chain and rule objects from
iptables, but in a more configurable way, and it also includes the
original datatype-agnostic set infrastructure with mapping support.
This set infrastructure is enhanced in the follow up patch (netfilter:
nf_tables: add netlink set API).
This patch includes the following components:
* the netlink API: net/netfilter/nf_tables_api.c and
include/uapi/netfilter/nf_tables.h
* the packet filter core: net/netfilter/nf_tables_core.c
* the expressions (described above): net/netfilter/nft_*.c
* the filter tables: arp, IPv4, IPv6 and bridge:
net/ipv4/netfilter/nf_tables_ipv4.c
net/ipv6/netfilter/nf_tables_ipv6.c
net/ipv4/netfilter/nf_tables_arp.c
net/bridge/netfilter/nf_tables_bridge.c
* the NAT table (IPv4 only):
net/ipv4/netfilter/nf_table_nat_ipv4.c
* the route table (similar to mangle):
net/ipv4/netfilter/nf_table_route_ipv4.c
net/ipv6/netfilter/nf_table_route_ipv6.c
* internal definitions under:
include/net/netfilter/nf_tables.h
include/net/netfilter/nf_tables_core.h
* It also includes an skeleton expression:
net/netfilter/nft_expr_template.c
and the preliminary implementation of the meta target
net/netfilter/nft_meta_target.c
It also includes a change in struct nf_hook_ops to add a new
pointer to store private data to the hook, that is used to store
the rule list per chain.
This patch is based on the patch from Patrick McHardy, plus merged
accumulated cleanups, fixes and small enhancements to the nftables
code that has been done since 2009, which are:
From Patrick McHardy:
* nf_tables: adjust netlink handler function signatures
* nf_tables: only retry table lookup after successful table module load
* nf_tables: fix event notification echo and avoid unnecessary messages
* nft_ct: add l3proto support
* nf_tables: pass expression context to nft_validate_data_load()
* nf_tables: remove redundant definition
* nft_ct: fix maxattr initialization
* nf_tables: fix invalid event type in nf_tables_getrule()
* nf_tables: simplify nft_data_init() usage
* nf_tables: build in more core modules
* nf_tables: fix double lookup expression unregistation
* nf_tables: move expression initialization to nf_tables_core.c
* nf_tables: build in payload module
* nf_tables: use NFPROTO constants
* nf_tables: rename pid variables to portid
* nf_tables: save 48 bits per rule
* nf_tables: introduce chain rename
* nf_tables: check for duplicate names on chain rename
* nf_tables: remove ability to specify handles for new rules
* nf_tables: return error for rule change request
* nf_tables: return error for NLM_F_REPLACE without rule handle
* nf_tables: include NLM_F_APPEND/NLM_F_REPLACE flags in rule notification
* nf_tables: fix NLM_F_MULTI usage in netlink notifications
* nf_tables: include NLM_F_APPEND in rule dumps
From Pablo Neira Ayuso:
* nf_tables: fix stack overflow in nf_tables_newrule
* nf_tables: nft_ct: fix compilation warning
* nf_tables: nft_ct: fix crash with invalid packets
* nft_log: group and qthreshold are 2^16
* nf_tables: nft_meta: fix socket uid,gid handling
* nft_counter: allow to restore counters
* nf_tables: fix module autoload
* nf_tables: allow to remove all rules placed in one chain
* nf_tables: use 64-bits rule handle instead of 16-bits
* nf_tables: fix chain after rule deletion
* nf_tables: improve deletion performance
* nf_tables: add missing code in route chain type
* nf_tables: rise maximum number of expressions from 12 to 128
* nf_tables: don't delete table if in use
* nf_tables: fix basechain release
From Tomasz Bursztyka:
* nf_tables: Add support for changing users chain's name
* nf_tables: Change chain's name to be fixed sized
* nf_tables: Add support for replacing a rule by another one
* nf_tables: Update uapi nftables netlink header documentation
From Florian Westphal:
* nft_log: group is u16, snaplen u32
From Phil Oester:
* nf_tables: operational limit match
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Mon, 14 Oct 2013 08:57:04 +0000 (10:57 +0200)]
netfilter: nf_nat: move alloc_null_binding to nf_nat_core.c
Similar to nat_decode_session, alloc_null_binding is needed for both
ip_tables and nf_tables, so move it to nf_nat_core.c. This change
is required by nf_tables.
This is an adapted version of the original patch from Patrick McHardy.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patrick McHardy [Thu, 10 Oct 2013 07:21:55 +0000 (09:21 +0200)]
netfilter: pass hook ops to hookfn
Pass the hook ops to the hookfn to allow for generic hook
functions. This change is required by nf_tables.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Eric Dumazet [Thu, 10 Oct 2013 15:43:00 +0000 (08:43 -0700)]
tcp: tcp_transmit_skb() optimizations
1) We need to take a timestamp only for skb that should be cloned.
Other skbs are not in write queue and no rtt estimation is done on them.
2) the unlikely() hint is wrong for receivers (they send pure ACK)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: MF Nowlan <fitz@cs.yale.edu>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 11 Oct 2013 20:28:55 +0000 (16:28 -0400)]
Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge
Included changes:
- update emails for A. Quartulli and M. Lindner in MAINTAINERS
- switch to the next on-the-wire protocol version
- introduce the T(ype) V(ersion) L(ength) V(alue) framework
- adjust the existing components to make them use the new TVLV code
- make the TT component use CRC32 instead of CRC16
- totally remove the VIS functionality (has been moved to userspace)
- reorder packet types and flags
- add static checks on packet format
- remove __packed from batadv_ogm_packet
David S. Miller [Thu, 10 Oct 2013 19:29:44 +0000 (15:29 -0400)]
Merge branch 'master' of git://git./linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
This series contains updates to i40e only.
Alex provides the majority of the patches against i40e, where he does
cleanup of the Tx and RX queues and to align the code with the known
good Tx/Rx queue code in the ixgbe driver.
Anjali provides an i40e patch to update link events to not print to
the log until the device is administratively up.
Catherine provides a patch to update the driver version.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 10 Oct 2013 07:04:37 +0000 (00:04 -0700)]
inet: rename ir_loc_port to ir_num
In commit
634fb979e8f ("inet: includes a sock_common in request_sock")
I forgot that the two ports in sock_common do not have same byte order :
skc_dport is __be16 (network order), but skc_num is __u16 (host order)
So sparse complains because ir_loc_port (mapped into skc_num) is
considered as __u16 while it should be __be16
Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
and perform appropriate htons/ntohs conversions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Catherine Sullivan [Sat, 28 Sep 2013 06:00:07 +0000 (06:00 +0000)]
i40e: Bump version
Update the version number of the driver.
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:01:03 +0000 (06:01 +0000)]
i40e: Add support for 64 bit netstats
This change brings support for 64 bit netstats to the driver. Previously
the stats were 64 bit but highly racy due to the fact that 64 bit
transactions are not atomic on 32 bit systems. This change makes is so
that the 64 bit byte and packet stats are reliable on all architectures.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:00:58 +0000 (06:00 +0000)]
i40e: Move rings from pointer to array to array of pointers
Allocate the queue pairs individually instead of as a group. This
allows for much easier queue management as it is possible to dynamically
resize the queues without having to free and allocate the entire block.
Ease statistic collection by treating Tx/Rx queue pairs as a single
unit. Each pair is allocated together and starts with a Tx queue and
ends with an Rx queue. By ordering them this way it is possible to know
the Rx offset based on a pointer to the Tx queue.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:00:53 +0000 (06:00 +0000)]
i40e: Replace ring container array with linked list
This replaces the ring container array with a linked list. The idea is
to make the logic much easier to deal with since this will allow us to
call a simple helper function from the q_vectors to go through the
entire list.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 07:01:44 +0000 (07:01 +0000)]
i40e: Move q_vectors from pointer to array to array of pointers
Allocate the q_vectors individually. The advantage to this is that it
allows for easier freeing and allocation. In addition it makes it so
that we could do node specific allocations at some point in the future.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:00:43 +0000 (06:00 +0000)]
i40e: Split bytes and packets from Rx/Tx stats
This makes it so that the Tx and Rx byte and packet counts are
separated from the rest of the statistics. This allows for better
isolation of these stats when we move them into the 64 bit statistics.
Simplify things by re-ordering how the stats display in ethtool.
Instead of displaying all of the Tx queues as a block, followed by all
the Rx queues, the new order is Tx[0], Rx[0], Tx[1], Rx[1], ..., Tx[n],
Rx[n]. This reduces the loops and cleans up the display for testing
purposes since it is very easy to verify if flow director is doing the
right thing as the Tx and Rx queue pair are shown in pairs.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:00:37 +0000 (06:00 +0000)]
i40e: Add support for Tx byte queue limits
Implement BQL (byte queue limit) support in i40e.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:00:32 +0000 (06:00 +0000)]
i40e: Drop dead code and flags from Tx hotpath
Drop Tx flag and TXSW which is tested but never set.
As a result of this change we can drop a complicated check that always
resulted in the final result of i40e_tx_csum being equal to the
CHECKSUM_PARTIAL value. As such we can replace the entire function call
with just a check for skb->summed == CHECKSUM_PARTIAL.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:00:27 +0000 (06:00 +0000)]
i40e: clean up Tx fast path
Sync the fast path for i40e_tx_map and i40e_clean_tx_irq so that they
are similar to igb and ixgbe.
- Only update the Tx descriptor ring in tx_map
- Make skb mapping always on the first buffer in the chain
- Drop the use of MAPPED_AS_PAGE Tx flag
- Only store flags on the first buffer_info structure
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:00:22 +0000 (06:00 +0000)]
i40e: Do not directly increment Tx next_to_use
Avoid directly incrementing next_to_use for multiple reasons. The main
reason being that if we directly increment it then it can attain a state
where it is equal to the ring count. Technically this is a state it
should not be able to reach but the way this is written it now can.
This patch pulls the value off into a register and then increments it
and writes back either the value or 0 depending on if the value is equal
to the ring count.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:00:17 +0000 (06:00 +0000)]
i40e: Cleanup Tx buffer info layout
- drop the mapped_as_page u8 from the Tx buffer info as it was unused
- use the DMA unmap accessors for Tx DMA
- replace checks of DMA with checks of the unmap length to verify if an
unmap is needed
- update the Tx buffer layout to make it consistent with igb, ixgbe
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Eric Dumazet [Thu, 10 Oct 2013 00:14:52 +0000 (17:14 -0700)]
tcp: use ACCESS_ONCE() in tcp_update_pacing_rate()
sk_pacing_rate is read by sch_fq packet scheduler at any time,
with no synchronization, so make sure we update it in a
sensible way. ACCESS_ONCE() is how we instruct compiler
to not do stupid things, like using the memory location
as a temporary variable.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 9 Oct 2013 22:21:29 +0000 (15:21 -0700)]
inet: includes a sock_common in request_sock
TCP listener refactoring, part 5 :
We want to be able to insert request sockets (SYN_RECV) into main
ehash table instead of the per listener hash table to allow RCU
lookups and remove listener lock contention.
This patch includes the needed struct sock_common in front
of struct request_sock
This means there is no more inet6_request_sock IPv6 specific
structure.
Following inet_request_sock fields were renamed as they became
macros to reference fields from struct sock_common.
Prefix ir_ was chosen to avoid name collisions.
loc_port -> ir_loc_port
loc_addr -> ir_loc_addr
rmt_addr -> ir_rmt_addr
rmt_port -> ir_rmt_port
iif -> ir_iif
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 8 Oct 2013 16:02:23 +0000 (09:02 -0700)]
net: gro: allow to build full sized skb
skb_gro_receive() is currently limited to 16 or 17 MSS per GRO skb,
typically 24616 bytes, because it fills up to MAX_SKB_FRAGS frags.
It's relatively easy to extend the skb using frag_list to allow
more frags to be appended into the last sk_buff.
This still builds very efficient skbs, and allows reaching 45 MSS per
skb.
(45 MSS GRO packet uses one skb plus a frag_list containing 2 additional
sk_buff)
High speed TCP flows benefit from this extension by lowering TCP stack
cpu usage (less packets stored in receive queue, less ACK packets
processed)
Forwarding setups could be hurt, as such skbs will need to be
linearized, although its not a new problem, as GRO could already
provide skbs with a frag_list.
We could make the 65536 bytes threshold a tunable to mitigate this.
(First time we need to linearize skb in skb_needs_linearize(), we could
lower the tunable to ~16*1460 so that following skb_gro_receive() calls
build smaller skbs)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
baker.zhang [Tue, 8 Oct 2013 03:36:51 +0000 (11:36 +0800)]
fib_trie: only calc for the un-first node
This is a enhancement.
for the first node in fib_trie, newpos is 0, bit is 1.
Only for the leaf or node with unmatched key need calc pos.
Signed-off-by: baker.zhang <baker.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gao feng [Fri, 4 Oct 2013 08:52:24 +0000 (16:52 +0800)]
veth: allow to setup multicast address for veth device
We can only setup multicast address for network device when
net_device_ops->ndo_set_rx_mode is not null.
Some configurations need to add multicast address for net
device, such as netfilter cluster match module.
Add a fake ndo_set_rx_mode function to allow this operation.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Sat, 28 Sep 2013 06:00:12 +0000 (06:00 +0000)]
i40e: Drop unused completed stat
The Tx "completed" stat was part of the original rewrite for detecting
Tx hangs. However some time ago in ixgbe I determined that we could
just use the packets stat instead. Since then this stat was
removed from ixgbe and it serves no purpose in i40e so it can be
dropped.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anjali Singhai [Sat, 28 Sep 2013 06:00:02 +0000 (06:00 +0000)]
i40e: Link code updates
Link events should not print to the log until the device is
administratively up.
Signed-off-by: Anjali Singhai <anjali.singhai@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Ajit Khaparde [Thu, 3 Oct 2013 21:17:06 +0000 (16:17 -0500)]
be2net: change the driver version number to 4.9.224.0
Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ajit Khaparde [Thu, 3 Oct 2013 21:16:50 +0000 (16:16 -0500)]
be2net: Display RoCE specific counters in ethtool -S
SkyHawk-R can support RoCE. Add code to display RoCE specific
counters maintained in hardware.
Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ajit Khaparde [Thu, 3 Oct 2013 21:16:33 +0000 (16:16 -0500)]
be2net: Call version 2 of GET_STATS ioctl for Skyhawk-R
Moving to version 2 of GET_STATS command as SkyHawk-R supports
higher number of rings.
Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Wunderlich [Thu, 25 Apr 2013 08:37:25 +0000 (10:37 +0200)]
batman-adv: reorder batadv_iv_flags
The vis flag is not needed anymore, and since we do a compat bump we
can start with the first bit again
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Simon Wunderlich [Thu, 25 Apr 2013 08:37:24 +0000 (10:37 +0200)]
batman-adv: remove packed from batadv_ogm_packet
As we decreased the struct size from 26 to 24 byte, we can remove
__packed as the compiler will not add any more padding.
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Simon Wunderlich [Thu, 25 Apr 2013 08:37:23 +0000 (10:37 +0200)]
batman-adv: reorder packet types
Reordering the packet type numbers allows us to handle unicast
packets in a general way - even if we don't know the specific packet
type, we can still forward it. There was already code handling
this for a couple of unicast packets, and this is the more
generalized version to do that.
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Simon Wunderlich [Thu, 25 Apr 2013 08:37:22 +0000 (10:37 +0200)]
batman-adv: add build check macros for packet member offset
Since we removed the __packed from most of the packets, we should
make sure that the offset generated by the compiler are correct for
sent/received data.
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Simon Wunderlich [Thu, 25 Apr 2013 09:57:42 +0000 (11:57 +0200)]
batman-adv: remove vis functionality
This is replaced by a userspace program, we don't need this
functionality to bloat the kernel.
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Antonio Quartulli [Wed, 24 Apr 2013 14:37:52 +0000 (16:37 +0200)]
batman-adv: move BATADV_TT_CLIENT_TEMP to higher bit
Client flags from bit 0 to 7 are sent over the wire.
BATADV_TT_CLIENT_TEMP is a local flag and is not supposed
to be sent to the network. Therefore it has occupy a
higher bit.
Signed-off-by: Antonio Quartulli <ordex@autistici.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Antonio Quartulli [Wed, 24 Apr 2013 14:37:51 +0000 (16:37 +0200)]
batman-adv: use CRC32C instead of CRC16 in TT code
CRC32C has to be preferred to CRC16 because of its possible
HW native support and because of the reduced collision
probability. With this change the Translation Table
component now uses CRC32C to compute the local and global
table checksum.
Signed-off-by: Antonio Quartulli <ordex@autistici.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Marek Lindner [Tue, 23 Apr 2013 13:40:03 +0000 (21:40 +0800)]
batman-adv: tvlv - convert roaming adv packet to use tvlv unicast packets
Instead of generating roaming specific packets the TVLV unicast API is
used to send roaming information.
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Marek Lindner [Tue, 23 Apr 2013 13:40:02 +0000 (21:40 +0800)]
batman-adv: tvlv - convert tt query packet to use tvlv unicast packets
Instead of generating TT specific packets the TVLV unicast API is used
to send translation table data.
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Marek Lindner [Tue, 23 Apr 2013 13:40:01 +0000 (21:40 +0800)]
batman-adv: tvlv - convert tt data sent within OGMs
The translation table meta data (version number, crc checksum, etc)
as well as the translation table diff propgated within OGMs now uses
the newly introduced tvlv infrastructure.
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Marek Lindner [Tue, 23 Apr 2013 13:40:00 +0000 (21:40 +0800)]
batman-adv: tvlv - add network coding container
Create network coding container to announce network coding
capabilities (if enabled).
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Marek Lindner [Tue, 23 Apr 2013 13:39:59 +0000 (21:39 +0800)]
batman-adv: tvlv - add distributed arp table container
Create DAT container to announce DAT capabilities (if enabled).
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Marek Lindner [Tue, 23 Apr 2013 13:39:58 +0000 (21:39 +0800)]
batman-adv: tvlv - gateway download/upload bandwidth container
Prior to this patch batman-adv read the advertised uplink bandwidth
from userspace and compressed this information into a single byte
called "gateway class".
Now the download & upload bandwidth information is sent as-is. No
userspace change is necessary since the sysfs API always allowed
to specify a bandwidth.
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Spyros Gasteratos <morfeas3000@gmail.com>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Marek Lindner [Tue, 23 Apr 2013 13:39:57 +0000 (21:39 +0800)]
batman-adv: tvlv - basic infrastructure
The goal is to provide the infrastructure for sending, receiving and
parsing information 'containers' while preserving backward
compatibility. TVLV (based on the commonly known Type Length Value
technique) was chosen as the format for those containers. Even if a
node does not know the tvlv type of a certain container it can simply
skip the current container and proceed with the next. Past experience
has shown features evolve over time, so a 'version' field was added
right from the start to allow differentiating between feature
variants - hence the name: T(ype) V(ersion) L(ength) V(alue).
This patch introduces the basic TVLV infrastructure:
* register / unregister tvlv containers to be sent with each OGM
(on primary interfaces only)
* register / unregister callback handlers to be called upon
finding the corresponding tvlv type in a tvlv buffer
* unicast tvlv send / receive API calls
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Spyros Gasteratos <morfeas3000@gmail.com>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Antonio Quartulli [Sat, 20 Apr 2013 13:59:13 +0000 (15:59 +0200)]
batman-adv: switch to a new packet compatibility version
With this change batman-adv is breaking compatibility with
older versions and it is moving to compat-version 15.
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Martin Hundebøll <martin@hundeboll.net>
Signed-off-by: Antonio Quartulli <ordex@autistici.org>
Antonio Quartulli [Sat, 21 Sep 2013 11:46:56 +0000 (13:46 +0200)]
MAINTAINERS: batman-adv - update emails
Update my and Marek Lindner's email in the MAINTAINERS file
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Yuval Mintz [Wed, 9 Oct 2013 14:06:28 +0000 (16:06 +0200)]
bnx2x: Add ndo_get_phys_port_id support
Each network interface (either PF or VF) is identified by its port's MAC id.
Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 9 Oct 2013 10:05:48 +0000 (03:05 -0700)]
net: fix build errors if ipv6 is disabled
CONFIG_IPV6=n is still a valid choice ;)
It appears we can remove dead code.
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 9 Oct 2013 04:47:29 +0000 (21:47 -0700)]
udp: fix a typo in __udp4_lib_mcast_demux_lookup
At this point sk might contain garbage.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 3 Oct 2013 22:42:29 +0000 (15:42 -0700)]
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 3 Oct 2013 07:22:02 +0000 (00:22 -0700)]
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11,
8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10,
4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 9 Oct 2013 03:07:53 +0000 (23:07 -0400)]
Merge git://git./linux/kernel/git/davem/net
Conflicts:
include/linux/netdevice.h
net/core/sock.c
Trivial merge issues.
Removal of "extern" for functions declaration in netdevice.h
at the same time "const" was added to an argument.
Two parallel line additions in net/core/sock.c
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 9 Oct 2013 01:56:09 +0000 (21:56 -0400)]
Merge branch 'sfc-3.12' of git://git./linux/kernel/git/bwh/sfc
Ben Hutchings says:
====================
Some more fixes for EF10 support; hopefully the last lot:
1. Fixes for reading statistics, from Edward Cree and Jon Cooper.
2. Addition of ethtool statistics for packets dropped by the hardware
before they were associated with a specific function, from Edward Cree.
3. Only bind to functions that are in control of their associated port,
as the driver currently assumes this is the case.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 8 Oct 2013 22:16:00 +0000 (15:16 -0700)]
pkt_sched: fq: fix non TCP flows pacing
Steinar reported FQ pacing was not working for UDP flows.
It looks like the initial sk->sk_pacing_rate value of 0 was
a wrong choice. We should init it to ~0U (unlimited)
Then, TCA_FQ_FLOW_DEFAULT_RATE should be removed because it makes
no real sense. The default rate is really unlimited, and we
need to avoid a zero divide.
Reported-by: Steinar H. Gunderson <sesse@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 9 Oct 2013 01:52:03 +0000 (21:52 -0400)]
Revert "veth: Showing peer of veth type dev in ip link (kernel side)"
This reverts commit
612c337306f00dc8d396830212de51c475844791.
As per Stephen Hemminger, the layout of the netlink attribute
is not implemented correctly so revert this for now.
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Tue, 8 Oct 2013 03:32:17 +0000 (11:32 +0800)]
qlcnic: add missing destroy_workqueue() on error path in qlcnic_probe()
Add the missing destroy_workqueue() before return from
qlcnic_probe() in the error handling case.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Tue, 8 Oct 2013 03:19:19 +0000 (11:19 +0800)]
moxa: fix the error handling in moxart_mac_probe()
This patch fix the error handling in moxart_mac_probe():
- return -ENOMEM in some memory alloc fail cases
- add missing free_netdev() in the error handling case
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marc Kleine-Budde [Mon, 7 Oct 2013 21:19:58 +0000 (23:19 +0200)]
net: vlan: fix nlmsg size calculation in vlan_get_size()
This patch fixes the calculation of the nlmsg size, by adding the missing
nla_total_size().
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 7 Oct 2013 19:50:18 +0000 (12:50 -0700)]
pkt_sched: fq: fix typo for initial_quantum
TCA_FQ_INITIAL_QUANTUM should set q->initial_quantum
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Oussama Ghorbel [Mon, 7 Oct 2013 17:50:05 +0000 (18:50 +0100)]
ipv6: Fix the upper MTU limit in GRE tunnel
Unlike ipv4, the struct member hlen holds the length of the GRE and ipv6
headers. This length is also counted in dev->hard_header_len.
Perhaps, it's more clean to modify the hlen to count only the GRE header
without ipv6 header as the variable name suggest, but the simple way to fix
this without regression risk is simply modify the calculation of the limit
in ip6gre_tunnel_change_mtu function.
Verified in kernel version v3.11.
Signed-off-by: Oussama Ghorbel <ou.ghorbel@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gao feng [Tue, 8 Oct 2013 03:05:20 +0000 (11:05 +0800)]
cgroup: cls: remove unnecessary task_cls_classid
We can get classid through cgroup_subsys_state,
this is directviewing and effective.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gao feng [Tue, 8 Oct 2013 03:05:19 +0000 (11:05 +0800)]
cgroup: netprio: remove unnecessary task_netprioidx
Since the tasks have been migrated to the cgroup,
there is no need to call task_netprioidx to get
task's cgroup id.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shawn Bohrer [Mon, 7 Oct 2013 16:01:40 +0000 (11:01 -0500)]
net: ipv4 only populate IP_PKTINFO when needed
The since the removal of the routing cache computing
fib_compute_spec_dst() does a fib_table lookup for each UDP multicast
packet received. This has introduced a performance regression for some
UDP workloads.
This change skips populating the packet info for sockets that do not have
IP_PKTINFO set.
Benchmark results from a netperf UDP_RR test:
Before 89789.68 transactions/s
After 90587.62 transactions/s
Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.63us RTT
After 12.48us RTT
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shawn Bohrer [Mon, 7 Oct 2013 16:01:39 +0000 (11:01 -0500)]
udp: ipv4: Add udp early demux
The removal of the routing cache introduced a performance regression for
some UDP workloads since a dst lookup must be done for each packet.
This change caches the dst per socket in a similar manner to what we do
for TCP by implementing early_demux.
For UDP multicast we can only cache the dst if there is only one
receiving socket on the host. Since caching only works when there is
one receiving socket we do the multicast socket lookup using RCU.
For UDP unicast we only demux sockets with an exact match in order to
not break forwarding setups. Additionally since the hash chains may be
long we only check the first socket to see if it is a match and not
waste extra time searching the whole chain when we might not find an
exact match.
Benchmark results from a netperf UDP_RR test:
Before 87961.22 transactions/s
After 89789.68 transactions/s
Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.97us RTT
After 12.63us RTT
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shawn Bohrer [Mon, 7 Oct 2013 16:01:38 +0000 (11:01 -0500)]
udp: Only allow busy read/poll on connected sockets
UDP sockets can receive packets from multiple endpoints and thus may be
received on multiple receive queues. Since packets packets can arrive
on multiple receive queues we should not mark the napi_id for all
packets. This makes busy read/poll only work for connected UDP sockets.
This additionally enables busy read/poll for UDP multicast packets as
long as the socket is connected by moving the check into
__udp_queue_rcv_skb().
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 7 Oct 2013 15:32:32 +0000 (08:32 -0700)]
net_sched: increment drop counters in qdisc_tree_decrease_qlen()
qdisc_tree_decrease_qlen() is called when some packets are dropped
on a qdisc, and we want to notify parents of qlen changes.
We also can increment parents qdisc qstats drop counters.
This permits more accurate drop counters up to root qdisc.
For example a graft operation typically resets a qdisc
(drops all packets) and call qdisc_tree_decrease_qlen()
Note that callers are responsible for their drop counters.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Vrabel [Mon, 7 Oct 2013 12:55:19 +0000 (13:55 +0100)]
xen-netback: transition to CLOSED when removing a VIF
If a guest is destroyed without transitioning its frontend to CLOSED,
the domain becomes a zombie as netback was not grant unmapping the
shared rings.
When removing a VIF, transition the backend to CLOSED so the VIF is
disconnected if necessary (which will unmap the shared rings etc).
This fixes a regression introduced by
279f438e36c0a70b23b86d2090aeec50155034a9 (xen-netback: Don't destroy
the netdev until the vif is shut down).
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Paul Durrant <Paul.Durrant@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 8 Oct 2013 20:10:10 +0000 (16:10 -0400)]
Merge branch 'mlx4'
Amir Vadai says:
====================
net/mlx4_en: Fix pages never dma unmapped on rx
This patchset fixes a bug introduced by commit
51151a16 (mlx4: allow order-0
memory allocations in RX path). Where dma_unmap_page wasn't called.
Changes from V0:
- Added "Rename name of mlx4_en_rx_alloc members". Old names were confusing.
- Last frag in page calculation was wrong. Since all frags in page are of the
same size, need to add this frag_stride to end of frag offset, and not the
size of next frag in skb.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Amir Vadai [Mon, 7 Oct 2013 11:38:13 +0000 (13:38 +0200)]
net/mlx4_en: Fix pages never dma unmapped on rx
This patch fixes a bug introduced by commit
51151a16 (mlx4: allow
order-0 memory allocations in RX path).
dma_unmap_page never reached because condition to detect last fragment
in page is wrong. offset+frag_stride can't be greater than size, need to
make sure no additional frag will fit in page => compare offset +
frag_stride + next_frag_size instead.
next_frag_size is the same as the current one, since page is shared only
with frags of the same size.
CC: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amir Vadai [Mon, 7 Oct 2013 11:38:12 +0000 (13:38 +0200)]
net/mlx4_en: Rename name of mlx4_en_rx_alloc members
Add page prefix to page related members: @size and @offset into
@page_size and @page_offset
CC: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Veaceslav Falico [Mon, 7 Oct 2013 07:17:20 +0000 (09:17 +0200)]
bonding: ensure that TLB mode's active slave has correct mac filter
Currently, in TLB mode we change mac addresses only by memcpy-ing the to
net_device->dev_addr, without actually setting them via
dev_set_mac_address(). This permits us to receive all the traffic always on
one mac address.
However, in case the interface flips, some drivers might enforce the
mac filtering for its FW/HW based on current ->dev_addr, and thus we won't
be able to receive traffic on that interface, in case it will be selected
as active in TLB mode.
Fix it by setting the mac address forcefully on every new active slave that
we select in TLB mode.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Yuval Mintz <yuvalmin@broadcom.com>
Reported-by: Yuval Mintz <yuvalmin@broadcom.com>
Tested-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nguyen Hong Ky [Mon, 7 Oct 2013 06:29:25 +0000 (15:29 +0900)]
net: sh_eth: Fix RX packets errors on R8A7740
This patch will fix RX packets errors when receiving big size
of data by set bit RNC = 1.
RNC - Receive Enable Control
0: Upon completion of reception of one frame, the E-DMAC writes
the receive status to the descriptor and clears the RR bit in
EDRRR to 0.
1: Upon completion of reception of one frame, the E-DMAC writes
(writes back) the receive status to the descriptor. In addition,
the E-DMAC reads the next descriptor and prepares for reception
of the next frame.
In addition, for get more stable when receiving packets, I set
maximum size for the transmit/receive FIFO and inserts padding
in receive data.
Signed-off-by: Nguyen Hong Ky <nh-ky@jinso.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 8 Oct 2013 19:44:26 +0000 (15:44 -0400)]
l2tp: Fix build warning with ipv6 disabled.
net/l2tp/l2tp_core.c: In function ‘l2tp_verify_udp_checksum’:
net/l2tp/l2tp_core.c:499:22: warning: unused variable ‘tunnel’ [-Wunused-variable]
Create a helper "l2tp_tunnel()" to facilitate this, and as a side
effect get rid of a bunch of unnecessary void pointer casts.
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael S. Tsirkin [Sun, 6 Oct 2013 18:25:12 +0000 (21:25 +0300)]
tun: don't look at current when non-blocking
We play with a wait queue even if socket is
non blocking. This is an obvious waste.
Besides, it will prevent calling the non blocking
variant when current is not valid.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 8 Oct 2013 19:32:19 +0000 (15:32 -0400)]
Merge branch 'mrf24j40'
Alan Ott says:
====================
Fix race conditions in mrf24j40 interrupts
After testing with the betas of this patchset, it's been rebased and is
ready for inclusion.
David Hauweele noticed that the mrf24j40 would hang arbitrarily after some
period of heavy traffic. Two race conditions were discovered, and the
driver was changed to use threaded interrupts, since the enable/disable of
interrupts in the driver has recently been a lighning rod whenever issues
arise related to interrupts (costing engineering time), and since threaded
interrupts are the right way to do it.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alan Ott [Sun, 6 Oct 2013 03:52:24 +0000 (23:52 -0400)]
mrf24j40: Use level-triggered interrupts
The mrf24j40 generates level interrupts. There are rare cases where it
appears that the interrupt line never gets de-asserted between interrupts,
causing interrupts to be lost, and causing a hung device from the driver's
perspective. Switching the driver to interpret these interrupts as
level-triggered fixes this issue.
Signed-off-by: Alan Ott <alan@signal11.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alan Ott [Sun, 6 Oct 2013 03:52:23 +0000 (23:52 -0400)]
mrf24j40: Use threaded IRQ handler
Eliminate all the workqueue and interrupt enable/disable.
Signed-off-by: Alan Ott <alan@signal11.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alan Ott [Sun, 6 Oct 2013 03:52:22 +0000 (23:52 -0400)]
mrf24j40: Move INIT_COMPLETION() to before packet transmission
This avoids a race condition where complete(tx_complete) could be called
before tx_complete is initialized.
Signed-off-by: Alan Ott <alan@signal11.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 8 Oct 2013 19:28:53 +0000 (15:28 -0400)]
Merge branch '6lowpan'
Alan Ott says:
====================
Alexander Aring suggested that devices desired to be linked to 6lowpan
be checked for actually being of type IEEE802154, since IEEE802154 devices
are all that are supported by 6lowpan at present.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alan Ott [Sun, 6 Oct 2013 03:15:19 +0000 (23:15 -0400)]
6lowpan: Sync default hardware address of lowpan links to their wpan
When a lowpan link to a wpan device is created, set the hardware address
of the lowpan link to that of the wpan device.
Signed-off-by: Alan Ott <alan@signal11.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alan Ott [Sun, 6 Oct 2013 03:15:18 +0000 (23:15 -0400)]
6lowpan: Only make 6lowpan links to IEEE802154 devices
Refuse to create 6lowpan links if the actual hardware interface is
of any type other than ARPHRD_IEEE802154.
Signed-off-by: Alan Ott <alan@signal11.us>
Suggested-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Masatake YAMATO [Fri, 4 Oct 2013 02:34:21 +0000 (11:34 +0900)]
veth: Showing peer of veth type dev in ip link (kernel side)
ip link has ability to show extra information of net work device if
kernel provides sunh information. With this patch veth driver can
provide its peer ifindex information to ip command via netlink
interface.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Liu [Tue, 8 Oct 2013 09:54:21 +0000 (10:54 +0100)]
Revert "xen-netback: improve ring effeciency for guest RX"
This reverts commit
4f0581d25827d5e864bcf07b05d73d0d12a20a5c.
The named changeset is causing problem. Let's aim to make this part less
fragile before trying to improve things.
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Annie Li <annie.li@oracle.com>
Cc: Matt Wilson <msw@amazon.com>
Cc: Xi Xiong <xixiong@amazon.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Sat, 5 Oct 2013 20:15:30 +0000 (13:15 -0700)]
net: Update the sysctl permissions handler to test effective uid/gid
On Tue, 20 Aug 2013 11:40:04 -0500 Eric Sandeen <sandeen@redhat.com> wrote:
> This was brought up in a Red Hat bug (which may be marked private, I'm sorry):
>
> Bug 987055 - open O_WRONLY succeeds on some root owned files in /proc for process running with unprivileged EUID
>
> "On RHEL7 some of the files in /proc can be opened for writing by an unprivileged EUID."
>
> The flaw existed upstream as well last I checked.
>
> This commit in kernel v3.8 caused the regression:
>
> commit
cff109768b2d9c03095848f4cd4b0754117262aa
> Author: Eric W. Biederman <ebiederm@xmission.com>
> Date: Fri Nov 16 03:03:01 2012 +0000
>
> net: Update the per network namespace sysctls to be available to the network namespace owner
>
> - Allow anyone with CAP_NET_ADMIN rights in the user namespace of the
> the netowrk namespace to change sysctls.
> - Allow anyone the uid of the user namespace root the same
> permissions over the network namespace sysctls as the global root.
> - Allow anyone with gid of the user namespace root group the same
> permissions over the network namespace sysctl as the global root group.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
>
> because it changed /sys/net's special permission handler to test current_uid, not
> current_euid; same for current_gid/current_egid.
>
> So in this case, root cannot drop privs via set[ug]id, and retains all privs
> in this codepath.
Modify the code to use current_euid(), and in_egroup_p, as in done
in fs/proc/proc_sysctl.c:test_perm()
Cc: stable@vger.kernel.org
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marc Kleine-Budde [Sat, 5 Oct 2013 19:25:17 +0000 (21:25 +0200)]
can: dev: fix nlmsg size calculation in can_get_size()
This patch fixes the calculation of the nlmsg size, by adding the missing
nla_total_size().
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Opdenacker [Sat, 5 Oct 2013 04:45:30 +0000 (06:45 +0200)]
net: wan: remove deprecated IRQF_DISABLED
This patch proposes to remove the use of the IRQF_DISABLED flag
It's a NOOP since 2.6.35 and it will be removed one day.
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Opdenacker [Sat, 5 Oct 2013 04:39:24 +0000 (06:39 +0200)]
irda: remove deprecated IRQF_DISABLED
This patch proposes to remove the use of the IRQF_DISABLED flag
It's a NOOP since 2.6.35 and it will be removed one day.
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Opdenacker [Sat, 5 Oct 2013 04:25:46 +0000 (06:25 +0200)]
net: hamradio/yam: remove deprecated IRQF_DISABLED
This patch proposes to remove the use of the IRQF_DISABLED flag
It's a NOOP since 2.6.35 and it will be removed one day.
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Opdenacker [Sat, 5 Oct 2013 04:22:30 +0000 (06:22 +0200)]
net: hamradio/scc: remove deprecated IRQF_DISABLED
This patch proposes to remove the use of the IRQF_DISABLED flag
It's a NOOP since 2.6.35 and it will be removed one day.
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Matthew Whitehead [Fri, 4 Oct 2013 21:03:08 +0000 (17:03 -0400)]
net: fujitsu: Remove ISA depdendency from Kconfig
There no longer are ISA drivers in the fujitsu directory, so remove the
dependency from the Kconfig.
Signed-off-by: Matthew Whitehead <tedheadster@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 7 Oct 2013 19:40:44 +0000 (15:40 -0400)]
Merge git://git./linux/kernel/git/linville/wireless-next
Conflicts:
drivers/net/wireless/brcm80211/brcmfmac/dhd_bus.h
drivers/net/wireless/rtlwifi/rtl8188ee/phy.h
drivers/net/wireless/rtlwifi/rtl8192ce/phy.h
drivers/net/wireless/rtlwifi/rtl8192de/phy.h
drivers/net/wireless/rtlwifi/rtl8723ae/phy.h
Just some minor conflicts between the wireless-next changes
and Joe Perches's "extern" removal from function prototypes
in header files.
John W. Linville says:
====================
Regarding the Bluetooth bits, Gustavo says:
"The big work here is from Marcel and Johan. They did a lot of work
in the L2CAP, HCI and MGMT layers. The most important ones are the
addition of a new MGMT command to enable/disable LE advertisement
and the introduction of the HCI user channel to allow applications
to get directly and exclusive access to Bluetooth devices."
As to the ath10k bits, Kalle says:
"Bartosz dropped support for qca98xx hw1.0 hardware from ath10k, it's
just too much to support it. Michal added support for the new firmware
interface. Marek fixed WEP in AP and IBSS mode. Rest of the changes are
minor fixes or cleanups."
And also:
"Major changes are:
* throughput improvements including aligning the RX frames correctly and
optimising HTT layer (Michal)
* remove qca98xx hw1.0 support (Bartosz)
* add support for firmware version 999.999.0.636 (Michal)
* firmware htt statistics support (Kalle)
* fix WEP in AP and IBSS mode (Marek)
* fix a mutex unlock balance in debugfs file (Shafi)
And of course there's a lot of smaller fixes and cleanup."
For the wl12xx bits, Luca says:
"Here are some patches intended for 3.13. Eliad is upstreaming a bunch
of patches that have been pending in the internal tree. Mostly bugfixes
and other small improvements."
Along with that...
Arend and friends bring us a batch of brcmfmac updates, Larry Finger
offers some rtlwifi refactoring, and Sujith sends the usual batch of
ath9k updates. As usual, there are a number of other small updates
from a variety of players as well.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Fri, 4 Oct 2013 15:04:48 +0000 (17:04 +0200)]
ipv4: fix ineffective source address selection
When sending out multicast messages, the source address in inet->mc_addr is
ignored and rewritten by an autoselected one. This is caused by a typo in
commit
813b3b5db831 ("ipv4: Use caller's on-stack flowi as-is in output
route lookups").
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman [Sun, 6 Oct 2013 02:26:05 +0000 (19:26 -0700)]
net: Separate the close_list and the unreg_list v2
Separate the unreg_list and the close_list in dev_close_many preventing
dev_close_many from permuting the unreg_list. The permutations of the
unreg_list have resulted in cases where the loopback device is accessed
it has been freed in code such as dst_ifdown. Resulting in subtle memory
corruption.
This is the second bug from sharing the storage between the close_list
and the unreg_list. The issues that crop up with sharing are
apparently too subtle to show up in normal testing or usage, so let's
forget about being clever and use two separate lists.
v2: Make all callers pass in a close_list to dev_close_many
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Pargmann [Fri, 4 Oct 2013 12:44:40 +0000 (14:44 +0200)]
net/ethernet: cpsw: DT read bool dual_emac
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Mugunthan V N <mugunthanvnm@ti.com>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Pargmann [Fri, 4 Oct 2013 12:44:39 +0000 (14:44 +0200)]
net: ethernet: cpsw: Search childs for slave nodes
The current implementation searches the whole DT for nodes named
"slave".
This patch changes it to search only child nodes for slaves.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Mugunthan V N <mugunthanvnm@ti.com>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Fri, 4 Oct 2013 07:14:06 +0000 (00:14 -0700)]
net: fix unsafe set_memory_rw from softirq
on x86 system with net.core.bpf_jit_enable = 1
sudo tcpdump -i eth1 'tcp port 22'
causes the warning:
[ 56.766097] Possible unsafe locking scenario:
[ 56.766097]
[ 56.780146] CPU0
[ 56.786807] ----
[ 56.793188] lock(&(&vb->lock)->rlock);
[ 56.799593] <Interrupt>
[ 56.805889] lock(&(&vb->lock)->rlock);
[ 56.812266]
[ 56.812266] *** DEADLOCK ***
[ 56.812266]
[ 56.830670] 1 lock held by ksoftirqd/1/13:
[ 56.836838] #0: (rcu_read_lock){.+.+..}, at: [<
ffffffff8118f44c>] vm_unmap_aliases+0x8c/0x380
[ 56.849757]
[ 56.849757] stack backtrace:
[ 56.862194] CPU: 1 PID: 13 Comm: ksoftirqd/1 Not tainted 3.12.0-rc3+ #45
[ 56.868721] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
[ 56.882004]
ffffffff821944c0 ffff88080bbdb8c8 ffffffff8175a145 0000000000000007
[ 56.895630]
ffff88080bbd5f40 ffff88080bbdb928 ffffffff81755b14 0000000000000001
[ 56.909313]
ffff880800000001 ffff880800000000 ffffffff8101178f 0000000000000001
[ 56.923006] Call Trace:
[ 56.929532] [<
ffffffff8175a145>] dump_stack+0x55/0x76
[ 56.936067] [<
ffffffff81755b14>] print_usage_bug+0x1f7/0x208
[ 56.942445] [<
ffffffff8101178f>] ? save_stack_trace+0x2f/0x50
[ 56.948932] [<
ffffffff810cc0a0>] ? check_usage_backwards+0x150/0x150
[ 56.955470] [<
ffffffff810ccb52>] mark_lock+0x282/0x2c0
[ 56.961945] [<
ffffffff810ccfed>] __lock_acquire+0x45d/0x1d50
[ 56.968474] [<
ffffffff810cce6e>] ? __lock_acquire+0x2de/0x1d50
[ 56.975140] [<
ffffffff81393bf5>] ? cpumask_next_and+0x55/0x90
[ 56.981942] [<
ffffffff810cef72>] lock_acquire+0x92/0x1d0
[ 56.988745] [<
ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
[ 56.995619] [<
ffffffff817628f1>] _raw_spin_lock+0x41/0x50
[ 57.002493] [<
ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
[ 57.009447] [<
ffffffff8118f52a>] vm_unmap_aliases+0x16a/0x380
[ 57.016477] [<
ffffffff8118f44c>] ? vm_unmap_aliases+0x8c/0x380
[ 57.023607] [<
ffffffff810436b0>] change_page_attr_set_clr+0xc0/0x460
[ 57.030818] [<
ffffffff810cfb8d>] ? trace_hardirqs_on+0xd/0x10
[ 57.037896] [<
ffffffff811a8330>] ? kmem_cache_free+0xb0/0x2b0
[ 57.044789] [<
ffffffff811b59c3>] ? free_object_rcu+0x93/0xa0
[ 57.051720] [<
ffffffff81043d9f>] set_memory_rw+0x2f/0x40
[ 57.058727] [<
ffffffff8104e17c>] bpf_jit_free+0x2c/0x40
[ 57.065577] [<
ffffffff81642cba>] sk_filter_release_rcu+0x1a/0x30
[ 57.072338] [<
ffffffff811108e2>] rcu_process_callbacks+0x202/0x7c0
[ 57.078962] [<
ffffffff81057f17>] __do_softirq+0xf7/0x3f0
[ 57.085373] [<
ffffffff81058245>] run_ksoftirqd+0x35/0x70
cannot reuse jited filter memory, since it's readonly,
so use original bpf insns memory to hold work_struct
defer kfree of sk_filter until jit completed freeing
tested on x86_64 and i386
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Hutchings [Mon, 7 Oct 2013 19:10:11 +0000 (20:10 +0100)]
sfc: Only bind to EF10 functions with the LinkCtrl and Trusted flags
Although we do not yet enable multiple PFs per port, it is possible
that a board will be reconfigured to enable them while the driver has
not yet been updated to fully support this.
The most obvious problem is that multiple functions may try to set
conflicting link settings. But we will also run into trouble if the
firmware doesn't consider us fully trusted. So, abort probing unless
both the LinkCtrl and Trusted flags are set for this function.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Oussama Ghorbel [Thu, 3 Oct 2013 13:49:26 +0000 (14:49 +0100)]
ipv6: Allow the MTU of ipip6 tunnel to be set below 1280
The (inner) MTU of a ipip6 (IPv4-in-IPv6) tunnel cannot be set below 1280, which is the minimum MTU in IPv6.
However, there should be no IPv6 on the tunnel interface at all, so the IPv6 rules should not apply.
More info at https://bugzilla.kernel.org/show_bug.cgi?id=15530
This patch allows to check the minimum MTU for ipv6 tunnel according to these rules:
-In case the tunnel is configured with ipip6 mode the minimum MTU is 68.
-In case the tunnel is configured with ip6ip6 or any mode the minimum MTU is 1280.
Signed-off-by: Oussama Ghorbel <ou.ghorbel@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael S. Tsirkin [Wed, 2 Oct 2013 06:14:06 +0000 (09:14 +0300)]
netif_set_xps_queue: make cpu mask const
virtio wants to pass in cpumask_of(cpu), make parameter
const to avoid build warnings.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Wed, 25 Sep 2013 16:32:09 +0000 (17:32 +0100)]
sfc: Add PM and RXDP drop counters to ethtool stats
Recognise the new Packet Memory and RX Data Path counters.
The following counters are added:
rx_pm_{trunc,discard}_bb_overflow - burst buffer overflowed. This should not
occur if BB correctly configured.
rx_pm_{trunc,discard}_vfifo_full - not enough space in packet memory. May
indicate RX performance problems.
rx_pm_{trunc,discard}_qbb - dropped by 802.1Qbb early discard mechanism.
Since Qbb is not supported at present, this should not occur.
rx_pm_discard_mapping - 802.1p priority configured to be dropped. This should
not occur in normal operation.
rx_dp_q_disabled_packets - packet was to be delivered to a queue but queue is
disabled. May indicate misconfiguration by the driver.
rx_dp_di_dropped_packets - parser-dispatcher indicated that a packet should be
dropped.
rx_dp_streaming_packets - packet was sent to the RXDP streaming bus, ie. a
filter directed the packet to the MCPU.
rx_dp_emerg_{fetch,wait} - RX datapath had to wait for descriptors to be
loaded. Indicates performance problems but not drops.
These are only provided if the MC firmware has the
PM_AND_RXDP_COUNTERS capability. Otherwise, mask them out.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Matthew Slattery [Tue, 10 Sep 2013 18:06:27 +0000 (19:06 +0100)]
sfc: Add definitions for new stats counters and capability flag
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Edward Cree [Fri, 27 Sep 2013 17:52:49 +0000 (18:52 +0100)]
sfc: Refactor EF10 stat mask code to allow for more conditional stats
Previously, efx_ef10_stat_mask returned a static const unsigned long[], which
meant that each possible mask had to be declared statically with
STAT_MASK_BITMAP. Since adding a condition would double the size of the
decision tree, we now create the bitmask dynamically.
To do this, we have two functions efx_ef10_raw_stat_mask, which returns a u64,
and efx_ef10_get_stat_mask, which fills in an unsigned long * argument.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>