Paolo Abeni [Fri, 28 Apr 2017 09:20:01 +0000 (11:20 +0200)]
infiniband: call ipv6 route lookup via the stub interface
The infiniband address handle can be triggered to resolve an ipv6
address in response to MAD packets, regardless of the ipv6
module being disabled via the kernel command line argument.
That will cause a call into the ipv6 routing code, which is not
initialized, and a conseguent oops.
This commit addresses the above issue replacing the direct lookup
call with an indirect one via the ipv6 stub, which is properly
initialized according to the ipv6 status (e.g. if ipv6 is
disabled, the routing lookup fails gracefully)
Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Amrani, Ram [Thu, 27 Apr 2017 10:35:35 +0000 (13:35 +0300)]
RDMA/qedr: add support for send+invalidate in poll CQ
Split the poll responder CQ into two functions.
Add support for send+invalidate in poll CQ.
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Amrani, Ram [Thu, 27 Apr 2017 10:35:34 +0000 (13:35 +0300)]
RDMA/qedr: destroy CQ only after HW releases it
Wait for all relevant CNQ interrupts before freeing the CQ.
Don't invoke completion handlers for a destroyed CQ.
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Amrani, Ram [Thu, 27 Apr 2017 10:35:33 +0000 (13:35 +0300)]
RDMA/qedr: enhance destroy flow for GSI QP
Avoid attempting to release irrelevant (and unused) resources for GSI QP.
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Amrani, Ram [Thu, 27 Apr 2017 10:35:32 +0000 (13:35 +0300)]
RDMA/qedr: properly check atomic capabilities
After checking the path upwards towards root complex, actualy check
root complex atomic_req capability, and not our own NIC.
Verify that the PCIe device control register's atomic egress block
is cleared in the path.
Verify that the PCIe version is at least 2.
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Amrani, Ram [Thu, 27 Apr 2017 10:35:31 +0000 (13:35 +0300)]
RDMA/qedr: reset access control when registering a MR
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Jason Gunthorpe [Thu, 6 Apr 2017 22:33:14 +0000 (16:33 -0600)]
IB/vmw_pvrdma: Spare annotate imm_data
imm_data is copied directly from the ib_send_wr and ib_wc which have
it marked as __be32, copy that mark into the uapi structures as well.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Adit Ranadive <aditr@vmware.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Artemy Kovalyov [Wed, 5 Apr 2017 06:23:59 +0000 (09:23 +0300)]
IB/mlx5: Add ODP support to MW
Internally MW implemented as KLM MKey and filled by userspace UMR
postsends. Handle pagefault trigered by operations on this MKeys.
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Artemy Kovalyov [Wed, 5 Apr 2017 06:23:58 +0000 (09:23 +0300)]
IB/mlx5: Extract page fault code
To make page fault handling code more flexible
split pagefault_single_data_segment() function.
Keep MR resolution in pagefault_single_data_segment() and
move actual updates into pagefault_single_mr().
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Artemy Kovalyov [Wed, 5 Apr 2017 06:23:57 +0000 (09:23 +0300)]
IB/umem: Add support to huge ODP
Add IB_ACCESS_HUGETLB ib_reg_mr flag.
Hugetlb region registered with this flag
will use single translation entry per huge page.
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Artemy Kovalyov [Wed, 5 Apr 2017 06:23:56 +0000 (09:23 +0300)]
IB/mlx5: Add contiguous ODP support
Currenlty ODP supports only regular MMU pages.
Add ODP support for regions consisting of physically contiguous chunks
of arbitrary order (huge pages for instance) to improve performance.
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Artemy Kovalyov [Wed, 5 Apr 2017 06:23:55 +0000 (09:23 +0300)]
IB/umem: Add contiguous ODP support
Currenlty ODP supports only regular MMU pages.
Add ODP support for regions consisting of physically contiguous chunks
of arbitrary order (huge pages for instance) to improve performance.
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Artemy Kovalyov [Wed, 5 Apr 2017 06:23:54 +0000 (09:23 +0300)]
IB/mlx5: Decrease verbosity level of ODP errors
Decrease verbosity level of ODP error flows messages to debug level.
Remove one redundant print since debug level message already exists in
this flow.
Fixes:
d9aaed838765 ('{net,IB}/mlx5: Refactor page fault handling')
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Artemy Kovalyov [Wed, 5 Apr 2017 06:23:53 +0000 (09:23 +0300)]
IB/mlx5: Fix implicit MR GC
When implicit MR's leaf MKey becomes unused, i.e. when it's
last page being released my MMU invalidation it is marked as "dying"
and scheduled for release by garbage collector.
Currentle consequent page fault may remove "dying" flag.
Treat leaf MKey as non-existent once it was scheduled to removal
by GC.
Fixes:
81713d3788d2 ('IB/mlx5: Add implicit MR support')
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Artemy Kovalyov [Wed, 5 Apr 2017 06:23:52 +0000 (09:23 +0300)]
IB/mlx5: Fix UMR size calculation
Translation table updates of large UMR may require multiple post send
operations. The last operations can be in various lengths, but current
code set them to be the same length.
Fixes:
7d0cc6edcc70 ('IB/mlx5: Add MR cache for large UMR regions')
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Artemy Kovalyov [Wed, 5 Apr 2017 06:23:51 +0000 (09:23 +0300)]
IB/mlx5: Fix function updating xlt emergency path
In memory shortage path we fall back to use spare buffer.
mlx5_ib_update_xlt() called from ib_uverbs_reg_mr when ibmr.ucontext
not initialized yet.
Scenario how to test it:
1. trigger memory exhaustion so __get_free_pages(GFP_KERNEL, 4) will fail
2. register MR
3. there should be no kernel oops
Fixes:
7d0cc6edcc70 ('IB/mlx5: Add MR cache for large UMR regions')
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Artemy Kovalyov [Wed, 5 Apr 2017 06:23:50 +0000 (09:23 +0300)]
IB: Replace ib_umem page_size by page_shift
Size of pages are held by struct ib_umem in page_size field.
It is better to store it as an exponent, because page size by nature
is always power-of-two and used as a factor, divisor or ilog2's argument.
The conversion of page_size to be page_shift allows to have portable
code and avoid following error while compiling on ARM:
ERROR: "__aeabi_uldivmod" [drivers/infiniband/core/ib_core.ko] undefined!
CC: Selvin Xavier <selvin.xavier@broadcom.com>
CC: Steve Wise <swise@chelsio.com>
CC: Lijun Ou <oulijun@huawei.com>
CC: Shiraz Saleem <shiraz.saleem@intel.com>
CC: Adit Ranadive <aditr@vmware.com>
CC: Dennis Dalessandro <dennis.dalessandro@intel.com>
CC: Ram Amrani <Ram.Amrani@Cavium.com>
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Acked-by: Ram Amrani <Ram.Amrani@cavium.com>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Zhu Yanjun [Sat, 1 Apr 2017 03:42:55 +0000 (23:42 -0400)]
IB/core: change the return type to void
The function ib_unregister_mad_agent always returns zero. And
this returned value is not checked. As such, chane the return
type to void.
CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Selvin Xavier [Tue, 4 Apr 2017 08:30:38 +0000 (01:30 -0700)]
MAINTAINERS: Update ocrdma module status
Since ocrdma driver is not going to be updated with any
new development activity, except for critical bug fixes
reported by partners or customers, changing the module status
to "Odd Fixes". Also, updating the web page info and the
maintainers email addresses.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Ira Weiny [Fri, 31 Mar 2017 17:04:51 +0000 (13:04 -0400)]
IB/hfi: Fix up comments in engine mapping
Fix off by 1 error in comments documenting the sdma and send context
mappings.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Geert Uytterhoeven [Sun, 12 Mar 2017 13:16:53 +0000 (14:16 +0100)]
MAINTAINERS: Add file patterns for infiniband device tree bindings
Submitters of device tree binding documentation may forget to CC
the subsystem maintainer if this is missing.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: linux-rdma@vger.kernel.org
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vlad Tsyrklevich [Fri, 24 Mar 2017 19:55:17 +0000 (15:55 -0400)]
infiniband/uverbs: Fix integer overflows
The 'num_sge' variable is verfied to be smaller than the 'sge_count'
variable; however, since both are user-controlled it's possible to cause
an integer overflow for the kmalloc multiply on 32-bit platforms
(num_sge and sge_count are both defined u32). By crafting an input that
causes a smaller-than-expected allocation it's possible to write
controlled data out-of-bounds.
Signed-off-by: Vlad Tsyrklevich <vlad@tsyrklevich.net>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Arnd Bergmann [Fri, 24 Mar 2017 22:02:48 +0000 (23:02 +0100)]
infiniband: hns: avoid gcc-7.0.1 warning for uninitialized data
hns_roce_v1_cq_set_ci() calls roce_set_bit() on an uninitialized field,
which will then change only a few of its bits, causing a warning with
the latest gcc:
infiniband/hw/hns/hns_roce_hw_v1.c: In function 'hns_roce_v1_cq_set_ci':
infiniband/hw/hns/hns_roce_hw_v1.c:1854:23: error: 'doorbell[1]' is used uninitialized in this function [-Werror=uninitialized]
roce_set_bit(doorbell[1], ROCEE_DB_OTHERS_H_ROCEE_DB_OTH_HW_SYNS_S, 1);
The code is actually correct since we always set all bits of the
port_vlan field, but gcc correctly points out that the first
access does contain uninitialized data.
This initializes the field to zero first before setting the
individual bits.
Fixes:
9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Petr Mladek [Mon, 17 Oct 2016 15:39:32 +0000 (17:39 +0200)]
IB/fmr_pool: Convert the cleanup thread into kthread worker API
Kthreads are currently implemented as an infinite loop. Each
has its own variant of checks for terminating, freezing,
awakening. In many cases it is unclear to say in which state
it is and sometimes it is done a wrong way.
The plan is to convert kthreads into kthread_worker or workqueues
API. It allows to split the functionality into separate operations.
It helps to make a better structure. Also it defines a clean state
where no locks are taken, IRQs blocked, the kthread might sleep
or even be safely migrated.
The kthread worker API is useful when we want to have a dedicated
single thread for the work. It helps to make sure that it is
available when needed. Also it allows a better control, e.g.
define a scheduling priority.
This patch converts the frm_pool kthread into the kthread worker
API because I am not sure how busy the thread is. It is well
possible that it does not need a dedicated kthread and workqueues
would be perfectly fine. Well, the conversion between kthread
worker API and workqueues is pretty trivial.
The patch moves one iteration from the kthread into the work function.
It is queued only when there is a pending work. Therefore we do not
need to compare flush_ser and req_ser at the beginning. On the contrary,
the same work could be queued only once at a time. Therefore it has to
re-queue itself if some requests are pending.
Otherwise, wake_up_process() is replaced by queuing the work.
Important: The change is only compile tested. I did not find an easy
way how to check it in a real life.
Signed-off-by: Petr Mladek <pmladek@suse.com>
TO: Doug Ledford <dledford@redhat.com>
CC: Sean Hefty <sean.hefty@intel.com>
CC: Hal Rosenstock <hal.rosenstock@gmail.com>
CC: linux-rdma@vger.kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
Yuval Shaia [Tue, 14 Mar 2017 14:01:57 +0000 (16:01 +0200)]
{net,IB}/{rxe,usnic}: Utilize generic mac to eui32 function
This logic seems to be duplicated in (at least) three separate files.
Move it to one place so code can be re-use.
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Yuval Shaia [Wed, 1 Mar 2017 20:09:09 +0000 (22:09 +0200)]
IB/usnic: Remove unused functions
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Colin Ian King [Thu, 23 Feb 2017 11:22:53 +0000 (11:22 +0000)]
IB/iser: fix spelling mistake: "unexepected" -> "unexpected"
trivial fix to spelling mistake in iser_err error message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Ganesh Goudar [Thu, 23 Feb 2017 07:01:43 +0000 (12:31 +0530)]
iw_cxgb4: Use dsgl by default
Enable the use of dsgl by default and determine whether dsgl is
supported from lld info.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Bharat Potnuri <bharat@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Doug Ledford [Tue, 25 Apr 2017 18:00:59 +0000 (14:00 -0400)]
RDMA/bnxt_re: Use IS_ERR_OR_NULL where appropriate
Constructs such as if (ptr && !IS_ERR(ptr)) can be shorted to
just !IS_ERR_OR_NULL(ptr) instead. Make substitutions in the bnxt_re
driver where appropriate.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Colin Ian King [Fri, 17 Feb 2017 15:35:22 +0000 (15:35 +0000)]
RDMA/bnxt_re: remove redundant initialization of rc to zero
rc is initialized to zero but is then updated by calls to
bnxt_qplib_free_fast_reg_page_list and/or bnxt_qpliob_free_mrw
so the initialization is redundant and can be removed.
Detected with CoverityScan, CID#
1408448 ("Unused Value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Noa Osherovich [Thu, 20 Apr 2017 17:53:33 +0000 (20:53 +0300)]
IB/mlx5: Add support for active_width and active_speed in RoCE
Add missing calculation and translation of active_width and
active_speed for RoCE.
Fixes:
3f89a643eb295 ('IB/mlx5: Extend query_device/port to ...')
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Noa Osherovich [Thu, 20 Apr 2017 17:53:32 +0000 (20:53 +0300)]
IB/mlx5: Set mlx5_query_roce_port's return value to void
In case of an error, the properties reported to user
are zeroed out, so no need for a return value.
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Noa Osherovich [Thu, 20 Apr 2017 17:53:31 +0000 (20:53 +0300)]
IB/core: Add HDR speed enum
Add high data rate speed to the ib_port_speed enumeration.
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Moni Shoua [Thu, 20 Apr 2017 10:26:54 +0000 (13:26 +0300)]
IB/mlx5: Set correct SL in completion for RoCE
There is a difference when parsing a completion entry between Ethernet
and IB ports. When link layer is Ethernet the bits describe the type of
L3 header in the packet. In the case when link layer is Ethernet and VLAN
header is present the value of SL is equal to the 3 UP bits in the VLAN
header. If VLAN header is not present then the SL is undefined and consumer
of the completion should check if IB_WC_WITH_VLAN is set.
While that, this patch also fills the vlan_id field in the completion if
present.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Moni Shoua [Sun, 16 Apr 2017 04:31:34 +0000 (07:31 +0300)]
IB/cma: Send MRA for reply messages
Current implementation of RDMA_CM sends MRA (Message Receipt
Acknowledgment) only for request messages but not for response messages.
As a result, a slow active side of the connection may send a ready-to-use
message to the passive side in a delay that is too long for the passive
side to wait for.
This patch adds a call to ib_send_cm_mra() upon receiving a response
message and by this tells the other side to modify the service timeout
to a bigger value, 16 times than before. As in the request case, MRA
for reply will be sent only if a duplicate response has arrived.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Matan Barak <matan@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Parav Pandit [Sun, 16 Apr 2017 04:29:29 +0000 (07:29 +0300)]
IB/mlx5: Support congestion related counters
This patch adds support to query the congestion related hardware counters
through new command and links them with other hw counters being available
in hw_counters sysfs location.
In order to reuse existing infrastructure it renames related q_counter
data structures to more generic counters to reflect q_counters and
congestion counters and maybe some other counters in the future.
New hardware counters:
* rp_cnp_handled - CNP packets handled by the reaction point
* rp_cnp_ignored - CNP packets ignored by the reaction point
* np_cnp_sent - CNP packets sent by notification point to respond to
CE marked RoCE packets
* np_ecn_marked_roce_packets - CE marked RoCE packets received by
notification point
It also avoids returning ENOSYS which is specific for invalid
system call and produces the following checkpatch.pl warning.
WARNING: ENOSYS means 'invalid syscall nr' and nothing else
+ return -ENOSYS;
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Sat, 15 Apr 2017 15:47:25 +0000 (18:47 +0300)]
IB/mthca: Check validity of output parameter pointer
The mthca driver didn't check supplied pointer to functions
mthca_cmd_poll() and mthca_cmd_wait(). This caused to the following
smatch errors:
drivers/infiniband/hw/mthca/mthca_cmd.c:371 mthca_cmd_poll() error: we previously assumed 'out_param' could be null (see line 353)
drivers/infiniband/hw/mthca/mthca_cmd.c:454 mthca_cmd_wait() error: we previously assumed 'out_param' could be null (see line 432)
In reality all callers of these functions are setting out_is_imm
flag are providing pointer too. However it is better to check
again to remove smatch errors to achieve warning free subsystem.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Slava Shwartsman [Mon, 3 Apr 2017 10:13:52 +0000 (13:13 +0300)]
IB/mlx5: Add drop flow steering rule support
A drop rule is described by an action drop and no destination.
If a user specified IB_FLOW_SPEC_ACTION_DROP then set the action
to MLX5_FLOW_CONTEXT_ACTION_DROP and clear the destination.
Signed-off-by: Slava Shwartsman <slavash@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Slava Shwartsman [Mon, 3 Apr 2017 10:13:51 +0000 (13:13 +0300)]
IB/core: Introduce drop flow specification
This flow steering specification identifies flow for drop by the HW.
If user create a flow only with the drop specification,
then all the packets that hit this flow will be dropped, otherwise the HW
will drop only the packets that match the other L2/L3/L4 specifications.
Signed-off-by: Slava Shwartsman <slavash@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Ariel Levkovich [Mon, 3 Apr 2017 10:11:03 +0000 (13:11 +0300)]
IB/mlx5: Use IP version matching to classify IP traffic
This change adds the ability for flow steering to classify IPv4/6
packets with MPLS tag (Ethertype 0x8847 and 0x8848) as standard IP
packets and hit IPv4/6 classifed steering rules.
When user added a flow rule with IP classification, driver was
implicitly adding ethertype matching to the created rule in order
to distinguish between IPv4 and IPv6 protocols.
Since IP packets with MPLS tag header have MPLS ethertype, they missed
the rule and ended up hitting the default filters.
Such behavior prevented from MPLS packets to undergo inbound traffic
load balancing flows (if such were defined by configuring RSS) to
achieve higher throughput - the way that non-MPLS IP packets performed.
Since our device is able to look past the MPLS tag and identify the
next protocol we introduce this solution which replaces Ethertype
matching by the device's capability to perform IP version parsing
and matching in order to distinguish between IPv4 and IPv6.
Therefore, whenever a flow with IP spec is added and device support IP
version matching, driver will implicitly add IP version matching to the
rule (Based on the IP spec type) without Ethertype matching which will
cause relevant MPLS tagged packets to hit this rule as well.
Otherwise (device doesn't support IP version matching), we fall back to
setting Ethertype matching.
If the user's filters specify an L2 ethertype and an IP spec
the rule will then match both the ethertype and the IP version.
The device's support for IP version matching is reported by the
device via dedicated capability bit in query_device_cap and named
outer/inner_ip_version.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Ariel Levkovich [Mon, 3 Apr 2017 10:11:02 +0000 (13:11 +0300)]
IB/mlx5: Add inner spec and IPv6 validation in user's flow attribute list
This change fixes an incomplete validation of the user's
flow attributes list.
Previous implementation validated only matching of IPv4 Ethertype
to IPv4 spec of outer headers (in case both Ethernet with specified
Ethertype and IP specs were present) and lacked the validation of:
1. Matching of IPv6 Ethertype in Ethernet spec (if such exists) to an
IPv6 protocol spec (if such exists).
2. Validation of Ethertype to IP protocol matching on inner headers specs.
Which could cause some combinations of unmatching Ethernet and IP
protocols to pass validation and apply on the device.
The fix adds validation of IPv6 Ethertype and IP spec as well as
performing the scan on both outer and inner attributes.
Fixes:
038d2ef87572 ("Add flow steering support")
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Bodong Wang [Wed, 29 Mar 2017 03:12:14 +0000 (06:12 +0300)]
IB/mlx5: Fix wrong use of kfree at bad flow in create_cq_user
The kfree was called to free cqb, while it should free *cqb.
Fixes:
1cbe6fc86ccf ("IB/mlx5: Add support for CQE compressing")
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Maor Gottlieb [Wed, 29 Mar 2017 03:09:01 +0000 (06:09 +0300)]
IB/mlx5: Enlarge autogroup flow table
In order to enlarge the flow group size to 8k, we decrease
the number of flow group types to 6 and increase the flow
table size to 64k.
Flow group size is calculated as follow:
group_size = table_size / (#group_types + 1)
Fixes:
038d2ef87572 ('IB/mlx5: Add flow steering support')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Maor Gottlieb [Wed, 29 Mar 2017 03:09:00 +0000 (06:09 +0300)]
IB/mlx5: Check supported flow table size
Check that the required flow table size is supported
by device. Return ENOMEM error if no space left.
In addition change the create flow table routine
to return ENOMEM instead of ENOSPC.
Fixes:
038d2ef87572 ('IB/mlx5: Add flow steering support')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Maor Gottlieb [Wed, 29 Mar 2017 03:03:03 +0000 (06:03 +0300)]
IB/mlx5: Change vma from shared to private
Anonymous VMA (->vm_ops == NULL) cannot be shared, otherwise
it would lead to SIGBUS.
Remove the shared flags from the vma after we change it to be
anonymous.
This is easily reproduced by doing modprobe -r while running a
user-space application such as raw_ethernet_bw.
Fixes:
7c2344c3bbf97 ('IB/mlx5: Implements disassociate_ucontext API')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Maor Gottlieb [Wed, 29 Mar 2017 03:03:02 +0000 (06:03 +0300)]
IB/mlx5: Take write semaphore when changing the vma struct
When the driver disassociate user context, it changes the vma to
anonymous by setting the vm_ops to null and zap the vma ptes.
In order to avoid race in the kernel, we need to take write lock
before we change the vma entries.
Fixes:
7c2344c3bbf97 ('IB/mlx5: Implements disassociate_ucontext API')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Maor Gottlieb [Wed, 29 Mar 2017 03:03:01 +0000 (06:03 +0300)]
IB/mlx4: Change vma from shared to private
Anonymous VMA (->vm_ops == NULL) cannot be shared, otherwise
it would lead to SIGBUS.
Remove the shared flags from the vma after we change it to be
anonymous.
This is easily reproduced by doing modprobe -r while running a
user-space application such as raw_ethernet_bw.
Fixes:
ae184ddeca5db ('IB/mlx4_ib: Disassociate support')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Maor Gottlieb [Wed, 29 Mar 2017 03:03:00 +0000 (06:03 +0300)]
IB/mlx4: Take write semaphore when changing the vma struct
When the driver disassociate user context, it changes the vma to
anonymous by setting the vm_ops to null and zap the vma ptes.
In order to avoid race in the kernel, we need to take write lock
before we change the vma entries.
Fixes:
ae184ddeca5db ('IB/mlx4_ib: Disassociate support')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Jack Morgenstein [Tue, 21 Mar 2017 10:57:06 +0000 (12:57 +0200)]
IB/mlx4: Reduce SRIOV multicast cleanup warning message to debug level
A warning message during SRIOV multicast cleanup should have actually been
a debug level message. The condition generating the warning does no harm
and can fill the message log.
In some cases, during testing, some tests were so intense as to swamp the
message log with these warning messages, causing a stall in the console
message log output task. This stall caused an NMI to be sent to all CPUs
(so that they all dumped their stacks into the message log).
Aside from the message flood causing an NMI, the tests all passed.
Once the message flood which caused the NMI is removed (by reducing the
warning message to debug level), the NMI no longer occurs.
Sample message log (console log) output illustrating the flood and
resultant NMI (snippets with comments and modified with ... instead
of hex digits, to satisfy checkpatch.pl):
<mlx4_ib> _mlx4_ib_mcg_port_cleanup: ... WARNING: group refcount 1!!!...
*** About 4000 almost identical lines in less than one second ***
<mlx4_ib> _mlx4_ib_mcg_port_cleanup: ... WARNING: group refcount 1!!!...
INFO: rcu_sched detected stalls on CPUs/tasks: { 17} (...)
*** { 17} above indicates that CPU 17 was the one that stalled ***
sending NMI to all CPUs:
...
NMI backtrace for cpu 17
CPU: 17 PID: 45909 Comm: kworker/17:2
Hardware name: HP ProLiant DL360p Gen8, BIOS P71 09/08/2013
Workqueue: events fb_flashcursor
task:
ffff880478...... ti:
ffff88064e...... task.ti:
ffff88064e......
RIP: 0010:[
ffffffff81......] [
ffffffff81......] io_serial_in+0x15/0x20
RSP: 0018:
ffff88064e257cb0 EFLAGS:
00000002
RAX:
0000000000...... RBX:
ffffffff81...... RCX:
0000000000......
RDX:
0000000000...... RSI:
0000000000...... RDI:
ffffffff81......
RBP:
ffff88064e...... R08:
ffffffff81...... R09:
0000000000......
R10:
0000000000...... R11:
ffff88064e...... R12:
0000000000......
R13:
0000000000...... R14:
ffffffff81...... R15:
0000000000......
FS:
0000000000......(0000) GS:
ffff8804af......(0000) knlGS:
000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080......
CR2:
00007f2a2f...... CR3:
0000000001...... CR4:
0000000000......
DR0:
0000000000...... DR1:
0000000000...... DR2:
0000000000......
DR3:
0000000000...... DR6:
00000000ff...... DR7:
0000000000......
Stack:
ffff88064e......
ffffffff81......
ffffffff81......
0000000000......
ffffffff81......
ffff88064e......
ffffffff81......
ffffffff81......
ffffffff81......
ffff88064e......
ffffffff81......
0000000000......
Call Trace:
[<
ffffffff813d099b>] wait_for_xmitr+0x3b/0xa0
[<
ffffffff813d0b5c>] serial8250_console_putchar+0x1c/0x30
[<
ffffffff813d0b40>] ? serial8250_console_write+0x140/0x140
[<
ffffffff813cb5fa>] uart_console_write+0x3a/0x80
[<
ffffffff813d0aae>] serial8250_console_write+0xae/0x140
[<
ffffffff8107c4d1>] call_console_drivers.constprop.15+0x91/0xf0
[<
ffffffff8107d6cf>] console_unlock+0x3bf/0x400
[<
ffffffff813503cd>] fb_flashcursor+0x5d/0x140
[<
ffffffff81355c30>] ? bit_clear+0x120/0x120
[<
ffffffff8109d5fb>] process_one_work+0x17b/0x470
[<
ffffffff8109e3cb>] worker_thread+0x11b/0x400
[<
ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[<
ffffffff810a5aef>] kthread+0xcf/0xe0
[<
ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[<
ffffffff81645858>] ret_from_fork+0x58/0x90
[<
ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Code: 48 89 e5 d3 e6 48 63 f6 48 03 77 10 8b 06 5d c3 66 0f 1f 44 00 00 66 66 66 6
As indicated in the stack trace above, the console output task got swamped.
Fixes:
b9c5d6a64358 ("IB/mlx4: Add multicast group (MCG) paravirtualization for SR-IOV")
Cc: <stable@vger.kernel.org> # v3.6+
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Jack Morgenstein [Tue, 21 Mar 2017 10:57:05 +0000 (12:57 +0200)]
IB/mlx4: Fix ib device initialization error flow
In mlx4_ib_add, procedure mlx4_ib_alloc_eqs is called to allocate EQs.
However, in the mlx4_ib_add error flow, procedure mlx4_ib_free_eqs is not
called to free the allocated EQs.
Fixes:
e605b743f33d ("IB/mlx4: Increase the number of vectors (EQs) available for ULPs")
Cc: <stable@vger.kernel.org> # v3.4+
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Majd Dibbiny [Sun, 19 Mar 2017 09:01:28 +0000 (11:01 +0200)]
IB/mlx4: Support RAW Ethernet when RoCE is disabled
On some environments, such as certain SR-IOV VF configurations, RoCE
isn't supported for mlx4 Ethernet ports. Currently the driver will
not open IB device on that port.
This is problematic since we do want user-space RAW Ethernet QPs functionality
to remain in place. For that end, enhance the relevant driver flows such that we
do create a device instance in that case.
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Jack Morgenstein [Sun, 19 Mar 2017 08:55:57 +0000 (10:55 +0200)]
IB/core: Fix sysfs registration error flow
The kernel commit cited below restructured ib device management
so that the device kobject is initialized in ib_alloc_device.
As part of the restructuring, the kobject is now initialized in
procedure ib_alloc_device, and is later added to the device hierarchy
in the ib_register_device call stack, in procedure
ib_device_register_sysfs (which calls device_add).
However, in the ib_device_register_sysfs error flow, if an error
occurs following the call to device_add, the cleanup procedure
device_unregister is called. This call results in the device object
being deleted -- which results in various use-after-free crashes.
The correct cleanup call is device_del -- which undoes device_add
without deleting the device object.
The device object will then (correctly) be deleted in the
ib_register_device caller's error cleanup flow, when the caller invokes
ib_dealloc_device.
Fixes:
55aeed06544f6 ("IB/core: Make ib_alloc_device init the kobject")
Cc: <stable@vger.kernel.org> # v4.2+
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Parav Pandit [Sun, 19 Mar 2017 08:55:55 +0000 (10:55 +0200)]
IB/core: Fix kernel crash during fail to initialize device
This patch fixes the kernel crash that occurs during ib_dealloc_device()
called due to provider driver fails with an error after
ib_alloc_device() and before it can register using ib_register_device().
This crashed seen in tha lab as below which can occur with any IB device
which fails to perform its device initialization before invoking
ib_register_device().
This patch avoids touching cache and port immutable structures if device
is not yet initialized.
It also releases related memory when cache and port immutable data
structure initialization fails during register_device() state.
[81416.561946] BUG: unable to handle kernel NULL pointer dereference at (null)
[81416.570340] IP: ib_cache_release_one+0x29/0x80 [ib_core]
[81416.576222] PGD
78da66067
[81416.576223] PUD
7f2d7c067
[81416.579484] PMD 0
[81416.582720]
[81416.587242] Oops: 0000 [#1] SMP
[81416.722395] task:
ffff8807887515c0 task.stack:
ffffc900062c0000
[81416.729148] RIP: 0010:ib_cache_release_one+0x29/0x80 [ib_core]
[81416.735793] RSP: 0018:
ffffc900062c3a90 EFLAGS:
00010202
[81416.741823] RAX:
0000000000000000 RBX:
0000000000000001 RCX:
0000000000000000
[81416.749785] RDX:
0000000000000000 RSI:
0000000000000282 RDI:
ffff880859fec000
[81416.757757] RBP:
ffffc900062c3aa0 R08:
ffff8808536e5ac0 R09:
ffff880859fec5b0
[81416.765708] R10:
00000000536e5c01 R11:
ffff8808536e5ac0 R12:
ffff880859fec000
[81416.773672] R13:
0000000000000000 R14:
ffff8808536e5ac0 R15:
ffff88084ebc0060
[81416.781621] FS:
00007fd879fab740(0000) GS:
ffff88085fac0000(0000) knlGS:
0000000000000000
[81416.790522] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[81416.797094] CR2:
0000000000000000 CR3:
00000007eb215000 CR4:
00000000003406e0
[81416.805051] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[81416.812997] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[81416.820950] Call Trace:
[81416.824226] ib_device_release+0x1e/0x40 [ib_core]
[81416.829858] device_release+0x32/0xa0
[81416.834370] kobject_cleanup+0x63/0x170
[81416.839058] kobject_put+0x25/0x50
[81416.843319] ib_dealloc_device+0x25/0x40 [ib_core]
[81416.848986] mlx5_ib_add+0x163/0x1990 [mlx5_ib]
[81416.854414] mlx5_add_device+0x5a/0x160 [mlx5_core]
[81416.860191] mlx5_register_interface+0x8d/0xc0 [mlx5_core]
[81416.866587] ? 0xffffffffa09e9000
[81416.870816] mlx5_ib_init+0x15/0x17 [mlx5_ib]
[81416.876094] do_one_initcall+0x51/0x1b0
[81416.880861] ? __vunmap+0x85/0xd0
[81416.885113] ? kmem_cache_alloc_trace+0x14b/0x1b0
[81416.890768] ? vfree+0x2e/0x70
[81416.894762] do_init_module+0x60/0x1fa
[81416.899441] load_module+0x15f6/0x1af0
[81416.904114] ? __symbol_put+0x60/0x60
[81416.908709] ? ima_post_read_file+0x3d/0x80
[81416.913828] ? security_kernel_post_read_file+0x6b/0x80
[81416.920006] SYSC_finit_module+0xa6/0xf0
[81416.924888] SyS_finit_module+0xe/0x10
[81416.929568] entry_SYSCALL_64_fastpath+0x1a/0xa9
[81416.935089] RIP: 0033:0x7fd879494949
[81416.939543] RSP: 002b:
00007ffdbc1b4e58 EFLAGS:
00000202 ORIG_RAX:
0000000000000139
[81416.947982] RAX:
ffffffffffffffda RBX:
0000000001b66f00 RCX:
00007fd879494949
[81416.955965] RDX:
0000000000000000 RSI:
000000000041a13c RDI:
0000000000000003
[81416.963926] RBP:
0000000000000003 R08:
0000000000000000 R09:
0000000001b652a0
[81416.971861] R10:
0000000000000003 R11:
0000000000000202 R12:
00007ffdbc1b3e70
[81416.979763] R13:
00007ffdbc1b3e50 R14:
0000000000000005 R15:
0000000000000000
[81417.008005] RIP: ib_cache_release_one+0x29/0x80 [ib_core] RSP:
ffffc900062c3a90
[81417.016045] CR2:
0000000000000000
Fixes:
55aeed0654 ("IB/core: Make ib_alloc_device init the kobject")
Fixes:
7738613e7c ("IB/core: Add per port immutable struct to ib_device")
Cc: <stable@vger.kernel.org> # v4.2+
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Feras Daoud [Sun, 19 Mar 2017 09:18:55 +0000 (11:18 +0200)]
IB/ipoib: Fix deadlock between ipoib_stop and mcast join flow
Before calling ipoib_stop, rtnl_lock should be taken, then
the flow clears the IPOIB_FLAG_ADMIN_UP and IPOIB_FLAG_OPER_UP
flags, and waits for mcast completion if IPOIB_MCAST_FLAG_BUSY
is set.
On the other hand, the flow of multicast join task initializes
a mcast completion, sets the IPOIB_MCAST_FLAG_BUSY and calls
ipoib_mcast_join. If IPOIB_FLAG_OPER_UP flag is not set, this
call returns EINVAL without setting the mcast completion and
leads to a deadlock.
ipoib_stop |
| |
clear_bit(IPOIB_FLAG_ADMIN_UP) |
| |
Context Switch |
| ipoib_mcast_join_task
| |
| spin_lock_irq(lock)
| |
| init_completion(mcast)
| |
| set_bit(IPOIB_MCAST_FLAG_BUSY)
| |
| Context Switch
| |
clear_bit(IPOIB_FLAG_OPER_UP) |
| |
spin_lock_irqsave(lock) |
| |
Context Switch |
| ipoib_mcast_join
| return (-EINVAL)
| |
| spin_unlock_irq(lock)
| |
| Context Switch
| |
ipoib_mcast_dev_flush |
wait_for_completion(mcast) |
ipoib_stop will wait for mcast completion for ever, and will
not release the rtnl_lock. As a result panic occurs with the
following trace:
[13441.639268] Call Trace:
[13441.640150] [<
ffffffff8168b579>] schedule+0x29/0x70
[13441.641038] [<
ffffffff81688fc9>] schedule_timeout+0x239/0x2d0
[13441.641914] [<
ffffffff810bc017>] ? complete+0x47/0x50
[13441.642765] [<
ffffffff810a690d>] ? flush_workqueue_prep_pwqs+0x16d/0x200
[13441.643580] [<
ffffffff8168b956>] wait_for_completion+0x116/0x170
[13441.644434] [<
ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
[13441.645293] [<
ffffffffa05af170>] ipoib_mcast_dev_flush+0x150/0x190 [ib_ipoib]
[13441.646159] [<
ffffffffa05ac967>] ipoib_ib_dev_down+0x37/0x60 [ib_ipoib]
[13441.647013] [<
ffffffffa05a4805>] ipoib_stop+0x75/0x150 [ib_ipoib]
Fixes:
08bc327629cb ("IB/ipoib: fix for rare multicast join race condition")
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Feras Daoud [Sun, 19 Mar 2017 09:18:54 +0000 (11:18 +0200)]
IB/ipoib: Update broadcast object if PKey value was changed in index 0
Update the broadcast address in the priv->broadcast object when the
Pkey value changes in index 0, otherwise the multicast GID value will
keep the previous value of the PKey, and will not be updated.
This leads to interface state down because the interface will keep the
old PKey value.
For example, in SR-IOV environment, if the PF changes the value of PKey
index 0 for one of the VFs, then the VF receives PKey change event that
triggers heavy flush. This flush calls update_parent_pkey that update the
broadcast object and its relevant members. If in this case the multicast
GID will not be updated, the interface state will be down.
Fixes:
c2904141696e ("IPoIB: Fix pkey change flow for virtualization environments")
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
yonatanc [Thu, 20 Apr 2017 17:55:56 +0000 (20:55 +0300)]
IB/rxe: Cache dst in QP instead of getting it for each send
In RC QP there is no need to resolve the outgoing interface
for each packet, as this does not change during QP life cycle.
Instead cache the interface on the socket and use that one.
This improves performance by 12% by sparing redundant
calls to rxe_find_route.
ib_send_bw -d rxe0 -x 1 -n 9000 -e -s $((1024 * 1024 )) -l 100
----------------------------------------------------------------------------------------
| | bytes | iterations | BW peak[MB/sec] | BW average[MB/sec] | MsgRate[Mpps] |
----------------------------------------------------------------------------------------
| before |
1048576 | 9000 | inf | 551.21 | 0.000551 |
| after |
1048576 | 9000 | inf | 615.54 | 0.000616 |
----------------------------------------------------------------------------------------
Fixes:
8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
yonatanc [Thu, 20 Apr 2017 17:55:55 +0000 (20:55 +0300)]
IB/rxe: Offload CRC calculation when possible
Use CPU ability to perform CRC calculations, by
replacing direct calls to crc32_le() with crypto_shash_updata().
The overall performance gain measured with ib_send_bw tool is 10% and it
was tested on "Intel CPU ES-2660 v2 @ 2.20Ghz" CPU.
ib_send_bw -d rxe0 -x 1 -n 9000 -e -s $((1024 * 1024 )) -l 100
---------------------------------------------------------------------------------------------
| | bytes | iterations | BW peak[MB/sec] | BW average[MB/sec] | MsgRate[Mpps] |
---------------------------------------------------------------------------------------------
| crc32_le |
1048576 | 9000 | inf | 497.60 | 0.000498 |
| CRC offload |
1048576 | 9000 | inf | 546.70 | 0.000547 |
---------------------------------------------------------------------------------------------
Fixes:
8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Parav Pandit [Sun, 19 Mar 2017 09:20:57 +0000 (11:20 +0200)]
IB/rxe: Do not export module's private function
Function rxe_rcv is used internally in RXE and don't need to be
exported. This patch removes such export declaration.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Parav Pandit [Sun, 19 Mar 2017 09:20:56 +0000 (11:20 +0200)]
IB/rxe: Avoid accessing timers for non RC QPs
This patch avoids RNR NAK timer and retransmit timer initialization and
cleanup for non RC QPs (such as UD QP, GSI QP).
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Yonatan Cohen [Fri, 10 Mar 2017 16:23:56 +0000 (18:23 +0200)]
IB/rxe: Add port protocol stats
Expose new counters using the get_hw_stats callback.
We expose the following counters:
+---------------------+----------------------------------------+
| Name | Description |
|---------------------+----------------------------------------|
|sent_pkts | number of sent pkts |
|---------------------+----------------------------------------|
|rcvd_pkts | number of received packets |
|---------------------+----------------------------------------|
|out_of_sequence | number of errors due to packet |
| | transport sequence number |
|---------------------+----------------------------------------|
|duplicate_request | number of received duplicated packets. |
| | A request that previously executed is |
| | named duplicated. |
|---------------------+----------------------------------------|
|rcvd_rnr_err | number of received RNR by completer |
|---------------------+----------------------------------------|
|send_rnr_err | number of sent RNR by responder |
|---------------------+----------------------------------------|
|rcvd_seq_err | number of out of sequence packets |
| | received |
|---------------------+----------------------------------------|
|ack_deffered | number of deferred handling of ack |
| | packets. |
|---------------------+----------------------------------------|
|retry_exceeded_err | number of times retry exceeded |
|---------------------+----------------------------------------|
|completer_retry_err | number of times completer decided to |
| | retry |
|---------------------+----------------------------------------|
|send_err | number of failed send packet |
+---------------------+----------------------------------------+
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Andrew Boyer <andrew.boyer@dell.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Doug Ledford [Fri, 21 Apr 2017 02:18:54 +0000 (22:18 -0400)]
cxgb4: Convert PDBG to pr_debug the second
A couple spots were missed in the original patch to implement this
change. Add those spots.
Fixes:
a9a42886d0b3 (cxgb4: Convert PDBG to pr_debug)
Signed-off-by: Doug Ledford <dledford@redhat.com>
Markus Elfring [Thu, 16 Feb 2017 08:30:55 +0000 (09:30 +0100)]
IB/hns: Use kcalloc() in hns_roce_buddy_init()
* Multiplications for the size determination of memory allocations
indicated that array data structures should be processed.
Thus use the corresponding function "kcalloc".
This issue was detected by using the Coccinelle software.
* Replace the specification of data types by pointer dereferences
to make the corresponding size determinations a bit safer according to
the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Markus Elfring [Thu, 16 Feb 2017 07:50:11 +0000 (08:50 +0100)]
IB/hns: Use kmalloc_array() in hns_roce_cmd_use_events()
* A multiplication for the size determination of a memory allocation
indicated that an array data structure should be processed.
Thus use the corresponding function "kmalloc_array".
This issue was detected by using the Coccinelle software.
* Replace the specification of a data structure by a pointer dereference
to make the corresponding size determination a bit safer according to
the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Markus Elfring [Fri, 10 Feb 2017 20:45:38 +0000 (21:45 +0100)]
IB/hfi1: Coding style improvement (make sizeof use safer)
Replace the specification of a data structure by a reference to
the desired member as the parameter for the operator "sizeof" to make
the corresponding size determination a bit safer according to
the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Markus Elfring [Fri, 10 Feb 2017 07:50:45 +0000 (08:50 +0100)]
IB/hfi1: Remove intermediate var in hfi1_user_sdma_alloc_queues()
* Pass a product for a call of the function "vmalloc_user" without storing
it in an intermediate variable.
* Delete the local variable "memsize" which became unnecessary with
this refactoring.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Markus Elfring [Thu, 9 Feb 2017 15:06:12 +0000 (16:06 +0100)]
IB/hfi1: Use kcalloc() in hfi1_user_sdma_alloc_queues()
* Multiplications for the size determination of memory allocations
indicated that array data structures should be processed.
Thus reuse the corresponding function "kcalloc".
This issue was detected by using the Coccinelle software.
* Replace the specification of a data type by a pointer dereference
to make the corresponding size determination a bit safer according to
the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Markus Elfring [Thu, 9 Feb 2017 14:30:53 +0000 (15:30 +0100)]
IB/hfi1: Use kcalloc() in hfi1_user_exp_rcv_init()
* A multiplication for the size determination of a memory allocation
indicated that an array data structure should be processed.
Thus reuse the corresponding function "kcalloc".
This issue was detected by using the Coccinelle software.
* Replace the specification of a data type by a pointer dereference
to make the corresponding size determination a bit safer according to
the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Joe Perches [Thu, 9 Feb 2017 22:23:51 +0000 (14:23 -0800)]
cxgb4: Convert PDBG to pr_debug
Use a more typical logging style.
Miscellanea:
o Obsolete the c4iw_debug module parameter
o Coalesce formats
o Realign arguments
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Joe Perches [Thu, 9 Feb 2017 22:23:50 +0000 (14:23 -0800)]
cxgb4: Use more common logging style
Convert printks to pr_<level>
Miscellanea:
o Coalesce formats
o Realign arguments
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Joe Perches [Thu, 9 Feb 2017 22:23:49 +0000 (14:23 -0800)]
cxgb3: Convert PDBG to pr_debug
Using the normal mechanism, not an indirected one, is clearer.
Miscellanea:
o Coalesce formats
o Realign arguments
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Joe Perches [Thu, 9 Feb 2017 22:23:48 +0000 (14:23 -0800)]
cxgb3: Use more common logging style
Convert printks to pr_<level>
Miscellanea:
o Coalesce formats
o Realign arguments
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Erez Shitrit [Mon, 10 Apr 2017 08:22:30 +0000 (11:22 +0300)]
IB/IPoIB: Support acceleration options callbacks
IPoIB driver now uses the new set of callback functions.
If the hardware provider supports the new ipoib_options implementation,
the driver uses the callbacks in its data path flows, otherwise it uses the
driver default implementation for all data flows in its code.
The default implementation wasn't change and it is exactly as it was before
introduction of acceleration support.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Erez Shitrit [Mon, 10 Apr 2017 08:22:29 +0000 (11:22 +0300)]
IB/IPoIB: Use defined function for netdev_priv function
Make ipoib_priv point to netdev_priv where the code calls netdev_priv.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Erez Shitrit [Mon, 10 Apr 2017 08:22:28 +0000 (11:22 +0300)]
IB/IPoIB: Rename qpn to be dqpn in ipoib_send and post_send functions
Change of function parameter name from qpn to be dqpn.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Erez Shitrit [Mon, 10 Apr 2017 08:22:27 +0000 (11:22 +0300)]
IB/IPoIB: Separate control from HW operation on ipoib_open/stop ndo
This patch is preparing the netdev part at the IPoIB driver to be able
to use the ipoib_options.
It deals with the two flows from the .ndo: ipoib_open and ipoib_stop.
The code is rearranged as follows:
* All operations which deal with the hardware resources, (for example
change QP state, post-receive etc.) are performed in one place.
* All operations that are control oriented (like restart multicast task,
start the reap_ah etc.) are performed in separate place.
The functions that deal with the hardware resources now located at
__ipoib_ib_dev_open for the ipoib_open flow and __ipoib_ib_dev_stop
for ipoib_stop.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Erez Shitrit [Mon, 10 Apr 2017 08:22:26 +0000 (11:22 +0300)]
IB/IPoIB: Separate control and data related initializations
This patch prepares init and teardown flows so we can call them
through ipoib_options function pointers.
It arranges that area of code as the following:
* All operations which deal with the resource allocation/deletion
are performed in one place.
* All operations that are control oriented, meaning that they are not
connected to a specific hardware, are performed in a separate place.
The operations for allocation of hardware resources are now in the
function ipoib_dev_init_default, and the deletion of all the resources
are in ipoib_dev_uninit_default
The only exception is the creation of the PD object,
which is used both for resource allocation (create QP etc.)
and for control flows like creating AH.
It also does:
* Move creation of rx_ring and tx_ring to be in the resources
allocation area.
* Move the function ipoib_ib_dev_open that does the open device
to the control area instead of the dev_init which creates resources.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Niranjana Vishwanathapura [Mon, 10 Apr 2017 08:22:25 +0000 (11:22 +0300)]
IB/IPoIB: Introduce RDMA netdev interface and IPoIB structs
Add RDMA netdev interface to ib device structure allowing RDMA
netdev devices to be allocated by ib clients.
The idea is to allow to providers to optimize IPoIB data path.
New struct that includes functions and data member is exposed.
It exposes set of callback functions for handling data path flows
in IPoIB driver.
Each provider can support these set of functions in order
to optimize its specific data path, and let IPoIB to leverage
its data path.
There is an assumption, that providers should give the full set
of functions and not only part of them, in order to work properly.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vishwanathapura, Niranjana [Thu, 13 Apr 2017 03:29:30 +0000 (20:29 -0700)]
IB/hfi1: VNIC SDMA support
HFI1 VNIC SDMA support enables transmission of VNIC packets over SDMA.
Map VNIC queues to SDMA engines and support halting and wakeup of the
VNIC queues.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vishwanathapura, Niranjana [Thu, 13 Apr 2017 03:29:29 +0000 (20:29 -0700)]
IB/hfi1: Virtual Network Interface Controller (VNIC) HW support
HFI1 HW specific support for VNIC functionality.
Dynamically allocate a set of contexts for VNIC when the first vnic
port is instantiated. Allocate VNIC contexts from user contexts pool
and return them back to the same pool while freeing up. Set aside
enough MSI-X interrupts for VNIC contexts and assign them when the
contexts are allocated. On the receive side, use an RSM rule to
spread TCP/UDP streams among VNIC contexts.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vishwanathapura, Niranjana [Thu, 13 Apr 2017 03:29:28 +0000 (20:29 -0700)]
IB/hfi1: OPA_VNIC RDMA netdev support
Add support to create and free OPA_VNIC rdma netdev devices.
Implement netstack interface functionality including xmit_skb,
receive side NAPI etc. Also implement rdma netdev control functions.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vishwanathapura, Niranjana [Thu, 13 Apr 2017 03:29:27 +0000 (20:29 -0700)]
IB/opa-vnic: VNIC Ethernet Management Agent (VEMA) function
OPA VEMA function interfaces with the Infiniband MAD stack to exchange the
management information packets with the Ethernet Manager (EM).
It interfaces with the OPA VNIC netdev function to SET/GET the management
information. The information exchanged with the EM includes class port
details, encapsulation configuration, various counters, unicast and
multicast MAC list and the MAC table. It also supports sending traps
to the EM.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Sudeep Dutt <sudeep.dutt@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vishwanathapura, Niranjana [Thu, 13 Apr 2017 03:29:26 +0000 (20:29 -0700)]
IB/opa-vnic: VNIC Ethernet Management Agent (VEMA) interface
OPA VNIC EMA interface functions are the management interfaces to the OPA
VNIC netdev. Add support to add and remove VNIC ports. Implement the
required GET/SET management interface functions and processing of new
management information. Add support to send trap notifications upon various
events like interface status change, unicast/multicast mac list update and
mac address change.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vishwanathapura, Niranjana [Thu, 13 Apr 2017 03:29:25 +0000 (20:29 -0700)]
IB/opa-vnic: VNIC MAC table support
OPA VNIC MAC table contains the MAC address to DLID mappings provided by
the Ethernet manager. During transmission, the MAC table provides the MAC
address to DLID translation. Implement MAC table using simple hash list.
Also provide support to update/query the MAC table by Ethernet manager.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vishwanathapura, Niranjana [Thu, 13 Apr 2017 03:29:24 +0000 (20:29 -0700)]
IB/opa-vnic: VNIC statistics support
OPA VNIC driver statistics support maintains various counters including
standard netdev counters and the Ethernet manager defined counters.
Add the Ethtool hook to read the counters.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vishwanathapura, Niranjana [Thu, 13 Apr 2017 03:29:23 +0000 (20:29 -0700)]
IB/opa-vnic: VNIC Ethernet Management (EM) structure definitions
Define VNIC EM MAD structures and the associated macros. These structures
are used for information exchange between VNIC EM agent (EMA) on the host
and the Ethernet manager. These include the virtual ethernet switch (vesw)
port information, vesw port mac table, summay and error counters,
vesw port interface mac lists and the EMA trap.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vishwanathapura, Niranjana [Thu, 13 Apr 2017 03:29:22 +0000 (20:29 -0700)]
IB/opa-vnic: Virtual Network Interface Controller (VNIC) netdev
OPA VNIC netdev function supports Ethernet functionality over Omni-Path
fabric by encapsulating Ethernet packets inside Omni-Path packet header.
It allocates a rdma netdev device and interfaces with the network stack to
provide standard Ethernet network interfaces. It overrides HFI1 device's
netdev operations where it is required.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
Signed-off-by: Sudeep Dutt <sudeep.dutt@intel.com>
Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vishwanathapura, Niranjana [Thu, 13 Apr 2017 03:29:21 +0000 (20:29 -0700)]
IB/opa-vnic: Virtual Network Interface Controller (VNIC) interface
Define OPA VNIC interface between hardware independent VNIC
functionality and the hardware dependent VNIC functionality.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vishwanathapura, Niranjana [Thu, 13 Apr 2017 03:29:20 +0000 (20:29 -0700)]
IB/opa-vnic: RDMA NETDEV interface
Add rdma netdev interface to ib device structure allowing rdma netdev
devices to be allocated by ib clients.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vishwanathapura, Niranjana [Thu, 13 Apr 2017 03:29:19 +0000 (20:29 -0700)]
IB/opa-vnic: Virtual Network Interface Controller (VNIC) documentation
Add OPA VNIC design document explaining the VNIC architecture and the
driver design.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Doug Ledford [Thu, 20 Apr 2017 16:00:41 +0000 (12:00 -0400)]
Merge branch 'k.o/for-4.12' into k.o/for-4.12-rdma-netdevice
Matan Barak [Tue, 18 Apr 2017 09:03:42 +0000 (12:03 +0300)]
IB/core: Rename uverbs event file structure
Previously, ib_uverbs_event_file was suffixed by _file as it contained
the actual file information. Since it's now only used as base struct
for ib_uverbs_async_event_file and ib_uverbs_completion_event_file,
we change its name to ib_uverbs_event_queue. This represents its
logical role better.
Fixes:
1e7710f3f656 ('IB/core: Change completion channel to use the reworked objects schema')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 18 Apr 2017 09:03:41 +0000 (12:03 +0300)]
IB/core: Don't use is_async in event files to infer events size
Previously, we inferred the events size in ib_uverbs_event_read by
using the is_async flag. Instead of that, we pass the event size
directly.
Fixes:
1e7710f3f656 ('IB/core: Change completion channel to use the reworked objects schema')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 18 Apr 2017 09:03:40 +0000 (12:03 +0300)]
IB/core: A small refactor in destroy WQ handler
Instead of having uverbs_uobject_put both in the error flow and the
good flow, we unite them.
Fixes:
fd3c7904db6e ('IB/core: Change idr objects to use the new schema')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 18 Apr 2017 09:03:39 +0000 (12:03 +0300)]
IB/core: Nullify ib_uobject during allocation
Currently, we initialize all fields of ib_uobject straight after
allocation. Therefore, a kmalloc was sufficient. Since ib_uobject
could be embedded in a type specific structure, we nullify it to
spare programmer errors.
Fixes:
3832125624b7 ('IB/core: Add support for idr types')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 18 Apr 2017 09:03:38 +0000 (12:03 +0300)]
IB/core: Don't pass the lock state to _rdma_remove_commit_uobject
The only scenario where this function was called while the lock is
already taken is in the context cleanup scenario. Thus, in order not
to pass the lock state to this function, we just call the remove logic
straight from the cleanup context function.
Fixes:
3832125624b7 ('IB/core: Add support for idr types')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 18 Apr 2017 09:03:37 +0000 (12:03 +0300)]
IB/core: Rename write flag to exclusive in rdma_core
We rename the "write" flags to "exclusive", as it's used for both
WRITE and DESTROY actions.
Fixes:
3832125624b7 ('IB/core: Add support for idr types')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
David S. Miller [Mon, 17 Apr 2017 15:08:33 +0000 (11:08 -0400)]
Merge branch 'mlx5-RDMA-netdevice'
Saeed Mahameed says:
====================
Mellanox, mlx5 RDMA net device support
This series provides the lower level mlx5 support of RDMA netdevice
creation API [1] suggested and introduced by Intel's HFI OPA VNIC
netdevice driver [2], to enable IPoIB mlx5 RDMA netdevice creation.
mlx5 IPoIB RDMA netdev will serve as an acceleration netdevice for the current
IPoIB ULP generic netdevice, providing:
- mlx5 RSS support.
- mlx5 HW RX,TX offloads (checksum, TSO, LRO, etc ..).
- Full mlx5 HW features transparent to the ULP itself.
The idea here is to reuse and benefit from the already implemented mlx5e netdevice
management and channels API for both etherent and RDMA netdevices, since both IPoIB
and Ethernet netdevices share same common mlx5 HW resources (with some small
exceptions) and share most of the control/data path logic, it is more natural to
have them share the same code.
The differences between IPoIB and Ethernet netdevices can be summarized to:
Steering:
In mlx5, IPoIB traffic is sent and received from an underlay special QP, and in Ethernet
the traffic is handled by vports and vport steering is managed by e-switch or FW.
For IPoIB traffic to get steered correctly the only thing we need to do is to create RSS
HW contexts for RX and TX HW contexts for TX (similar to mlx5e) with the underlay QP attached to
them (underlay QP will be 0 in case of Ethernet).
RX,TX:
Since IPoIB traffic is different, slightly modified RX and TX handlers are required,
still we do some code reuse in data path via common helper functions.
All of the other generic netdevice and mlx5 aspects will be shared between mlx5 Ethernet
and IPoIB netdevices, e.g.
- Channels creation and handling (RQs,SQs,CQs, NAPI, interrupt moderation, etc..)
- Offloads, checksum, GRO, LRO, TSO, and more.
- netdevice logic and non Ethernet specific ndos (open/close, etc..)
In order to achieve what we want:
In patchet 1 to 3, Erez added the supported for underlay QP in mlx5_ifc and refactored
the mlx5 steering code to accept the underlay QP as a parameter for creating steering
objects and enabled flow steering for IB link.
Then we are going to use the mlx5e netdevice profile, which is already used to separate between
NIC and VF representors netdevices, to create new type of IPoIB netdevice profile.
For that, one small refactoring is required to make mlx5e netdevice profile management
more genetic and agnostic to link type which is done in patch #4.
In patch #5, we introduce ipoib.c to host all of mlx5 IPoIB (mlx5i) specific logic and a
skeleton for the IPoIB mlx5 netdevice profile, and we will start filling it in next patches,
using mlx5e already existing APIs.
Patch #6 and #7, Implement init/cleanup RX mlx5i netdev profile handlers to create mlx5 RSS
resources, same as mlx5e but without vlan and L2 steering tables.
Patch #8, Implement init/cleanup TX mlx5i netdev profile handlers, to create TX resources
same as mlx5e but with one TC (tc = 0) support.
Patch #9, Implement mlx5i open/close ndos, where we reuese the mlx5e channels API, to start/stop TX/RX channels.
Patch #10, Create the underlay QP and attach it to mlx5i RSS and TX HW contexts.
Patch #11 and #12, Break down the mlx5e xmit flow into smaller helper function and implement the
mlx5i IPoIB xmit routine.
Patch #13 and #14, Have an RX handler per netdevice profile. We already do this before this series
in a non clean way to separate between NIC netdev and VF representor RX handlers, in patch 13 we make
the RX handler generic and bound to a profile and in patch 14 we implement the IPoIB RX handlers.
Patch #15, Small cleanup to avoid e-switch with IPoIB netdev.
In order to enable mlx5 IPoIB, a merge between the IPoIB RDMA netdev offolad support [3]
- which was alread submitted to the rdma mailing list - and this series is required
plus an extra small patch [4] which will connect between both sides and actually enables the offload.
Once both patch-sets are merged into linux we will have to submit the extra small patch [4], to enable
the feature.
Thanks,
Saeed.
[1] https://patchwork.kernel.org/patch/
9676637/
[2] https://lwn.net/Articles/715453/
https://patchwork.kernel.org/patch/
9587815/
[3] https://patchwork.kernel.org/patch/
9672069/
[4] https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/commit/?id=
0141db6a686e32294dee015b7d07706162ba48d8
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Erez Shitrit [Thu, 13 Apr 2017 03:37:06 +0000 (06:37 +0300)]
hw/mlx5: Add New bit to check over QP creation
Add check for bit IB_QP_CREATE_NETIF_QP while creating QP.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed [Thu, 13 Apr 2017 03:37:05 +0000 (06:37 +0300)]
net/mlx5e: E-switch vport manager is valid for ethernet only
Currently the driver support only ethernet eswitch, and we want to
protect downstream IPoIB netdev from trying to access it in IB link.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed [Thu, 13 Apr 2017 03:37:04 +0000 (06:37 +0300)]
net/mlx5e: IPoIB, RX handler
Implement IPoIB RX SKB handler.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>