Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:25 +0000 (16:35 +1000)]
vfio: powerpc/spapr: Register memory and define IOMMU v2
The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.
Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.
This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update
into 2 different operations. It requires:
1) guest pages to be registered first
2) consequent map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.
The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.
The accounting is done per the user process.
This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.
In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.
This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).
As v2 IOMMU supports IODA2 and pre-IODA2 IOMMUs (which do not support
DDW API), this creates a default DMA window for IODA2 for consistency.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:24 +0000 (16:35 +1000)]
powerpc/mmu: Add userspace-to-physical addresses translation cache
We are adding support for DMA memory pre-registration to be used in
conjunction with VFIO. The idea is that the userspace which is going to
run a guest may want to pre-register a user space memory region so
it all gets pinned once and never goes away. Having this done,
a hypervisor will not have to pin/unpin pages on every DMA map/unmap
request. This is going to help with multiple pinning of the same memory.
Another use of it is in-kernel real mode (mmu off) acceleration of
DMA requests where real time translation of guest physical to host
physical addresses is non-trivial and may fail as linux ptes may be
temporarily invalid. Also, having cached host physical addresses
(compared to just pinning at the start and then walking the page table
again on every H_PUT_TCE), we can be sure that the addresses which we put
into TCE table are the ones we already pinned.
This adds a list of memory regions to mm_context_t. Each region consists
of a header and a list of physical addresses. This adds API to:
1. register/unregister memory regions;
2. do final cleanup (which puts all pre-registered pages);
3. do userspace to physical address translation;
4. manage usage counters; multiple registration of the same memory
is allowed (once per container).
This implements 2 counters per registered memory region:
- @mapped: incremented on every DMA mapping; decremented on unmapping;
initialized to 1 when a region is just registered; once it becomes zero,
no more mappings allowe;
- @used: incremented on every "register" ioctl; decremented on
"unregister"; unregistration is allowed for DMA mapped regions unless
it is the very last reference. For the very last reference this checks
that the region is still mapped and returns -EBUSY so the userspace
gets to know that memory is still pinned and unregistration needs to
be retried; @used remains 1.
Host physical addresses are stored in vmalloc'ed array. In order to
access these in the real mode (mmu off), there is a real_vmalloc_addr()
helper. In-kernel acceleration patchset will move it from KVM to MMU code.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:23 +0000 (16:35 +1000)]
vfio: powerpc/spapr: powerpc/powernv/ioda2: Use DMA windows API in ownership control
Before the IOMMU user (VFIO) would take control over the IOMMU table
belonging to a specific IOMMU group. This approach did not allow sharing
tables between IOMMU groups attached to the same container.
This introduces a new IOMMU ownership flavour when the user can not
just control the existing IOMMU table but remove/create tables on demand.
If an IOMMU implements take/release_ownership() callbacks, this lets
the user have full control over the IOMMU group. When the ownership
is taken, the platform code removes all the windows so the caller must
create them.
Before returning the ownership back to the platform code, VFIO
unprograms and removes all the tables it created.
This changes IODA2's onwership handler to remove the existing table
rather than manipulating with the existing one. From now on,
iommu_take_ownership() and iommu_release_ownership() are only called
from the vfio_iommu_spapr_tce driver.
Old-style ownership is still supported allowing VFIO to run on older
P5IOC2 and IODA IO controllers.
No change in userspace-visible behaviour is expected. Since it recreates
TCE tables on each ownership change, related kernel traces will appear
more often.
This adds a pnv_pci_ioda2_setup_default_config() which is called
when PE is being configured at boot time and when the ownership is
passed from VFIO to the platform code.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:22 +0000 (16:35 +1000)]
powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table
This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.
This stores the allocated table size in pnv_pci_ioda2_get_table_size()
so the locked_vm counter can be updated correctly when a table is
being disposed.
This defines an iommu_table_group_ops callback to let VFIO know
how much memory will be locked if a table is created.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:21 +0000 (16:35 +1000)]
powerpc/powernv/ioda2: Use new helpers to do proper cleanup on PE release
The existing code programmed TVT#0 with some address and then
immediately released that memory.
This makes use of pnv_pci_ioda2_unset_window() and
pnv_pci_ioda2_set_bypass() which do correct resource release and
TVT update.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:20 +0000 (16:35 +1000)]
vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API
This extends iommu_table_group_ops by a set of callbacks to support
dynamic DMA windows management.
create_table() creates a TCE table with specific parameters.
it receives iommu_table_group to know nodeid in order to allocate
TCE table memory closer to the PHB. The exact format of allocated
multi-level table might be also specific to the PHB model (not
the case now though).
This callback calculated the DMA window offset on a PCI bus from @num
and stores it in a just created table.
set_window() sets the window at specified TVT index + @num on PHB.
unset_window() unsets the window from specified TVT.
This adds a free() callback to iommu_table_ops to free the memory
(potentially a tree of tables) allocated for the TCE table.
create_table() and free() are supposed to be called once per
VFIO container and set_window()/unset_window() are supposed to be
called for every group in a container.
This adds IOMMU capabilities to iommu_table_group such as default
32bit window parameters and others. This makes use of new values in
vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
advertise pagemasks to the userspace.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:19 +0000 (16:35 +1000)]
powerpc/powernv: Implement multilevel TCE tables
TCE tables might get too big in case of 4K IOMMU pages and DDW enabled
on huge guests (hundreds of GB of RAM) so the kernel might be unable to
allocate contiguous chunk of physical memory to store the TCE table.
To address this, POWER8 CPU (actually, IODA2) supports multi-level
TCE tables, up to 5 levels which splits the table into a tree of
smaller subtables.
This adds multi-level TCE tables support to
pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages()
helpers.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:18 +0000 (16:35 +1000)]
powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window
This is a part of moving DMA window programming to an iommu_ops
callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as
a first parameter (not pnv_ioda_pe) as it is going to be used as
a callback for VFIO DDW code.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:17 +0000 (16:35 +1000)]
powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages
This is a part of moving TCE table allocation into an iommu_ops
callback to support multiple IOMMU groups per one VFIO container.
This moves the code which allocates the actual TCE tables to helpers:
pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages().
These do not allocate/free the iommu_table struct.
This enforces window size to be a power of two.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:16 +0000 (16:35 +1000)]
powerpc/powernv/ioda2: Rework iommu_table creation
This moves iommu_table creation to the beginning to make following changes
easier to review. This starts using table parameters from the iommu_table
struct.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:15 +0000 (16:35 +1000)]
powerpc/iommu/powernv: Release replaced TCE
At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.
Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.
This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE and DMA direction so
the caller can release the pages afterwards. The exchange() receives
a physical address unlike set() which receives linear mapping address;
and returns a physical address as the clear() does.
This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
for a platform to have exchange() implemented in order to support VFIO.
This replaces iommu_tce_build() and iommu_clear_tce() with
a single iommu_tce_xchg().
This makes sure that TCE permission bits are not set in TCE passed to
IOMMU API as those are to be calculated by platform code from
DMA direction.
This moves SetPageDirty() to the IOMMU code to make it work for both
VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
available later).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:14 +0000 (16:35 +1000)]
powerpc/powernv: Implement accessor to TCE entry
This replaces direct accesses to TCE table with a helper which
returns an TCE entry address. This does not make difference now but will
when multi-level TCE tables get introduces.
No change in behavior is expected.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:13 +0000 (16:35 +1000)]
powerpc/powernv/ioda2: Add TCE invalidation for all attached groups
The iommu_table struct keeps a list of IOMMU groups it is used for.
At the moment there is just a single group attached but further
patches will add TCE table sharing. When sharing is enabled, TCE cache
in each PE needs to be invalidated so does the patch.
This does not change pnv_pci_ioda1_tce_invalidate() as there is no plan
to enable TCE table sharing on PHBs older than IODA2.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:12 +0000 (16:35 +1000)]
powerpc/powernv/ioda2: Move TCE kill register address to PE
At the moment the DMA setup code looks for the "ibm,opal-tce-kill"
property which contains the TCE kill register address. Writing to
this register invalidates TCE cache on IODA/IODA2 hub.
This moves the register address from iommu_table to pnv_pnb as this
register belongs to PHB and invalidates TCE cache for all tables of
all attached PEs.
This moves the property reading/remapping code to a helper which is
called when DMA is being configured for PE and which does DMA setup
for both IODA1 and IODA2.
This adds a new pnv_pci_ioda2_tce_invalidate_entire() helper which
invalidates cache for the entire table. It should be called after
every call to opal_pci_map_pe_dma_window(). It was not required before
because there was just a single TCE table and 64bit DMA was handled via
bypass window (which has no table so no cache was used) but this is going
to change with Dynamic DMA windows (DDW).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:11 +0000 (16:35 +1000)]
powerpc/iommu: Fix IOMMU ownership control functions
This adds missing locks in iommu_take_ownership()/
iommu_release_ownership().
This marks all pages busy in iommu_table::it_map in order to catch
errors if there is an attempt to use this table while ownership over it
is taken.
This only clears TCE content if there is no page marked busy in it_map.
Clearing must be done outside of the table locks as iommu_clear_tce()
called from iommu_clear_tces_and_put_pages() does this.
In order to use bitmap_empty(), the existing code clears bit#0 which
is set even in an empty table if it is bus-mapped at 0 as
iommu_init_table() reserves page#0 to prevent buggy drivers
from crashing when allocated page is bus-mapped at zero
(which is correct). This restores the bit in the case of failure
to bring the it_map to the state it was in when we called
iommu_take_ownership().
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:10 +0000 (16:35 +1000)]
vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control
This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
which call in a loop iommu_take_ownership()/iommu_release_ownership()
for every table on the group. As there is just one now, no change in
behaviour is expected.
At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.
The set_bypass() callback is not really an iommu_table function but
IOMMU/PE function. This introduces a iommu_table_group_ops struct and
adds take_ownership()/release_ownership() callbacks to it which are
called when an external user takes/releases control over the IOMMU.
This replaces set_bypass() with ownership callbacks as it is not
necessarily just bypass enabling, it can be something else/more
so let's give it more generic name.
The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
The following patches will replace iommu_take_ownership/
iommu_release_ownership calls in IODA2 with full IOMMU table release/
create.
As we here and touching bypass control, this removes
pnv_pci_ioda2_setup_bypass_pe() as it does not do much
more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
initialization to pnv_pci_ioda2_setup_dma_pe.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:09 +0000 (16:35 +1000)]
powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group
So far one TCE table could only be used by one IOMMU group. However
IODA2 hardware allows programming the same TCE table address to
multiple PE allowing sharing tables.
This replaces a single pointer to a group in a iommu_table struct
with a linked list of groups which provides the way of invalidating
TCE cache for every PE when an actual TCE table is updated. This adds
pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group()
helpers to manage the list. However without VFIO, it is still going
to be a single IOMMU group per iommu_table.
This changes iommu_add_device() to add a device to a first group
from the group list of a table as it is only called from the platform
init code or PCI bus notifier and at these moments there is only
one group per table.
This does not change TCE invalidation code to loop through all
attached groups in order to simplify this patch and because
it is not really needed in most cases. IODA2 is fixed in a later
patch.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:08 +0000 (16:35 +1000)]
powerpc/spapr: vfio: Replace iommu_table with iommu_table_group
Modern IBM POWERPC systems support multiple (currently two) TCE tables
per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
for TCE tables. Right now just one table is supported.
This defines iommu_table_group struct which stores pointers to
iommu_group and iommu_table(s). This replaces iommu_table with
iommu_table_group where iommu_table was used to identify a group:
- iommu_register_group();
- iommudata of generic iommu_group;
This removes @data from iommu_table as it_table_group provides
same access to pnv_ioda_pe.
For IODA, instead of embedding iommu_table, the new iommu_table_group
keeps pointers to those. The iommu_table structs are allocated
dynamically.
For P5IOC2, both iommu_table_group and iommu_table are embedded into
PE struct. As there is no EEH and SRIOV support for P5IOC2,
iommu_free_table() should not be called on iommu_table struct pointers
so we can keep it embedded in pnv_phb::p5ioc2.
For pSeries, this replaces multiple calls of kzalloc_node() with a new
iommu_pseries_alloc_group() helper and stores the table group struct
pointer into the pci_dn struct. For release, a iommu_table_free_group()
helper is added.
This moves iommu_table struct allocation from SR-IOV code to
the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and
pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized.
This change is here because those lines had to be changed anyway.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:07 +0000 (16:35 +1000)]
powerpc/powernv/ioda/ioda2: Rework TCE invalidation in tce_build()/tce_free()
The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is
supposed to be called on IODA1/2 and not called on p5ioc2. It receives
start and end host addresses of TCE table.
IODA2 actually needs PCI addresses to invalidate the cache. Those
can be calculated from host addresses but since we are going
to implement multi-level TCE tables, calculating PCI address from
a host address might get either tricky or ugly as TCE table remains flat
on PCI bus but not in RAM.
This moves pnv_pci_ioda_tce_invalidate() from generic pnv_tce_build/
pnt_tce_free and defines IODA1/2-specific callbacks which call generic
ones and do PHB-model-specific TCE cache invalidation. P5IOC2 keeps
using generic callbacks as before.
This changes pnv_pci_ioda2_tce_invalidate() to receives TCE index and
number of pages which are PCI addresses shifted by IOMMU page shift.
No change in behaviour is expected.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:06 +0000 (16:35 +1000)]
powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
This adds a iommu_table_ops struct and puts pointer to it into
the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
callbacks from ppc_md to the new struct where they really belong to.
This adds the requirement for @it_ops to be initialized before calling
iommu_init_table() to make sure that we do not leave any IOMMU table
with iommu_table_ops uninitialized. This is not a parameter of
iommu_init_table() though as there will be cases when iommu_init_table()
will not be called on TCE tables, for example - VFIO.
This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_"
redundant prefixes.
This removes tce_xxx_rm handlers from ppc_md but does not add
them to iommu_table_ops as this will be done later if we decide to
support TCE hypercalls in real mode. This removes _vm callbacks as
only virtual mode is supported by now so this also removes @rm parameter.
For pSeries, this always uses tce_buildmulti_pSeriesLP/
tce_buildmulti_pSeriesLP. This changes multi callback to fall back to
tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
present. The reason for this is we still have to support "multitce=off"
boot parameter in disable_multitce() and we do not want to walk through
all IOMMU tables in the system and replace "multi" callbacks with single
ones.
For powernv, this defines _ops per PHB type which are P5IOC2/IODA1/IODA2.
This makes the callbacks for them public. Later patches will extend
callbacks for IODA1/2.
No change in behaviour is expected.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:05 +0000 (16:35 +1000)]
powerpc/powernv: Do not set "read" flag if direction==DMA_NONE
Normally a bitmap from the iommu_table is used to track what TCE entry
is in use. Since we are going to use iommu_table without its locks and
do xchg() instead, it becomes essential not to put bits which are not
implied in the direction flag as the old TCE value (more precisely -
the permission bits) will be used to decide whether to put the page or not.
This adds iommu_direction_to_tce_perm() (its counterpart is there already)
and uses it for powernv's pnv_tce_build().
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:04 +0000 (16:35 +1000)]
vfio: powerpc/spapr: Rework groups attaching
This is to make extended ownership and multiple groups support patches
simpler for review.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:03 +0000 (16:35 +1000)]
vfio: powerpc/spapr: Moving pinning/unpinning to helpers
This is a pretty mechanical patch to make next patches simpler.
New tce_iommu_unuse_page() helper does put_page() now but it might skip
that after the memory registering patch applied.
As we are here, this removes unnecessary checks for a value returned
by pfn_to_page() as it cannot possibly return NULL.
This moves tce_iommu_disable() later to let tce_iommu_clear() know if
the container has been enabled because if it has not been, then
put_page() must not be called on TCEs from the TCE table. This situation
is not yet possible but it will after KVM acceleration patchset is
applied.
This changes code to work with physical addresses rather than linear
mapping addresses for better code readability. Following patches will
add an xchg() callback for an IOMMU table which will accept/return
physical addresses (unlike current tce_build()) which will eliminate
redundant conversions.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:02 +0000 (16:35 +1000)]
vfio: powerpc/spapr: Disable DMA mappings on disabled container
At the moment DMA map/unmap requests are handled irrespective to
the container's state. This allows the user space to pin memory which
it might not be allowed to pin.
This adds checks to MAP/UNMAP that the container is enabled, otherwise
-EPERM is returned.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:01 +0000 (16:35 +1000)]
vfio: powerpc/spapr: Move locked_vm accounting to helpers
There moves locked pages accounting to helpers.
Later they will be reused for Dynamic DMA windows (DDW).
This reworks debug messages to show the current value and the limit.
This stores the locked pages number in the container so when unlocking
the iommu table pointer won't be needed. This does not have an effect
now but it will with the multiple tables per container as then we will
allow attaching/detaching groups on fly and we may end up having
a container with no group attached but with the counter incremented.
While we are here, update the comment explaining why RLIMIT_MEMLOCK
might be required to be bigger than the guest RAM. This also prints
pid of the current process in pr_warn/pr_debug.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:35:00 +0000 (16:35 +1000)]
vfio: powerpc/spapr: Use it_page_size
This makes use of the it_page_size from the iommu_table struct
as page size can differ.
This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:34:59 +0000 (16:34 +1000)]
vfio: powerpc/spapr: Check that IOMMU page is fully contained by system page
This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.
Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.
Since compound_order() and compound_head() work correctly on non-huge
pages, there is no need for additional check whether the page is huge.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:34:58 +0000 (16:34 +1000)]
vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver
This moves page pinning (get_user_pages_fast()/put_page()) code out of
the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs
to as the platform code does not deal with page pinning.
This makes iommu_take_ownership()/iommu_release_ownership() deal with
the IOMMU table bitmap only.
This removes page unpinning from iommu_take_ownership() as the actual
TCE table might contain garbage and doing put_page() on it is undefined
behaviour.
Besides the last part, the rest of the patch is mechanical.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:34:57 +0000 (16:34 +1000)]
powerpc/iommu: Always release iommu_table in iommu_free_table()
At the moment iommu_free_table() only releases memory if
the table was initialized for the platform code use, i.e. it had
it_map initialized (which purpose is to track DMA memory space use).
With dynamic DMA windows, we will need to be able to release
iommu_table even if it was used for VFIO in which case it_map is NULL
so does the patch.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:34:56 +0000 (16:34 +1000)]
powerpc/iommu: Put IOMMU group explicitly
So far an iommu_table lifetime was the same as PE. Dynamic DMA windows
will change this and iommu_free_table() will not always require
the group to be released.
This moves iommu_group_put() out of iommu_free_table().
This adds a iommu_pseries_free_table() helper which does
iommu_group_put() and iommu_free_table(). Later it will be
changed to receive a table_group and we will have to change less
lines then.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:34:55 +0000 (16:34 +1000)]
powerpc/powernv/ioda: Clean up IOMMU group registration
The existing code has 3 calls to iommu_register_group() and
all 3 branches actually cover all possible cases.
This replaces 3 calls with one and moves the registration earlier;
the latter will make more sense when we add TCE table sharing.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:34:54 +0000 (16:34 +1000)]
powerpc/iommu/powernv: Get rid of set_iommu_table_base_and_group
The set_iommu_table_base_and_group() name suggests that the function
sets table base and add a device to an IOMMU group.
The actual purpose for table base setting is to put some reference
into a device so later iommu_add_device() can get the IOMMU group
reference and the device to the group.
At the moment a group cannot be explicitly passed to iommu_add_device()
as we want it to work from the bus notifier, we can fix it later and
remove confusing calls of set_iommu_table_base().
This replaces set_iommu_table_base_and_group() with a couple of
set_iommu_table_base() + iommu_add_device() which makes reading the code
easier.
This adds few comments why set_iommu_table_base() and iommu_add_device()
are called where they are called.
For IODA1/2, this essentially removes iommu_add_device() call from
the pnv_pci_ioda_dma_dev_setup() as it will always fail at this particular
place:
- for physical PE, the device is already attached by iommu_add_device()
in pnv_pci_ioda_setup_dma_pe();
- for virtual PE, the sysfs entries are not ready to create all symlinks
so actual adding is happening in tce_iommu_bus_notifier.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Fri, 5 Jun 2015 06:34:53 +0000 (16:34 +1000)]
powerpc/eeh/ioda2: Use device::iommu_group to check IOMMU group
This relies on the fact that a PCI device always has an IOMMU table
which may not be the case when we get dynamic DMA windows so
let's use more reliable check for IOMMU group here.
As we do not rely on the table presence here, remove the workaround
from pnv_pci_ioda2_set_bypass(); also remove the @add_to_iommu_group
parameter from pnv_ioda_setup_bus_dma().
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Cyril Bur [Tue, 2 Jun 2015 04:26:09 +0000 (14:26 +1000)]
mtd: powernv: Add powernv flash MTD abstraction driver
Powerpc powernv platforms allow access to certain system flash devices
through a firmwarwe interface. This change adds an mtd driver for these
flash devices.
Minor updates from Jeremy Kerr and Joel Stanley.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Reviewed-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Tue, 14 Apr 2015 07:35:57 +0000 (13:05 +0530)]
powerpc/mm: Add trace point for tracking hash pte fault
This enables us to understand how many hash fault we are taking
when running benchmarks.
For ex:
-bash-4.2# ./perf stat -e powerpc:hash_fault -e page-faults /tmp/ebizzy.ppc64 -S 30 -P -n 1000
...
Performance counter stats for '/tmp/ebizzy.ppc64 -S 30 -P -n 1000':
1,10,04,075 powerpc:hash_fault
1,10,03,429 page-faults
30.
865978991 seconds time elapsed
NOTE:
The impact of the tracepoint was not noticeable when running test. It was
within the run-time variance of the test. For ex:
without-patch:
--------------
Performance counter stats for './a.out 3000 300':
643 page-faults # 0.089 M/sec
7.236562 task-clock (msec) # 0.928 CPUs utilized
2,179,213 stalled-cycles-frontend # 0.00% frontend cycles idle
17,174,367 stalled-cycles-backend # 0.00% backend cycles idle
0 context-switches # 0.000 K/sec
0.
007794658 seconds time elapsed
And with-patch:
---------------
Performance counter stats for './a.out 3000 300':
643 page-faults # 0.089 M/sec
7.233746 task-clock (msec) # 0.921 CPUs utilized
0 context-switches # 0.000 K/sec
0.
007854876 seconds time elapsed
Performance counter stats for './a.out 3000 300':
643 page-faults # 0.087 M/sec
649 powerpc:hash_fault # 0.087 M/sec
7.430376 task-clock (msec) # 0.938 CPUs utilized
2,347,174 stalled-cycles-frontend # 0.00% frontend cycles idle
17,524,282 stalled-cycles-backend # 0.00% backend cycles idle
0 context-switches # 0.000 K/sec
0.
007920284 seconds time elapsed
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anshuman Khandual [Thu, 21 May 2015 06:43:13 +0000 (12:13 +0530)]
selftests/powerpc: Add gitignore file for the new DSCR tests
This patch adds .gitignore for all the newly added DSCR tests.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Sun, 7 Jun 2015 09:25:16 +0000 (19:25 +1000)]
selftests/powerpc: Add thread based stress test for DSCR sysfs interfaces
This patch adds a test to update the system wide DSCR value repeatedly
and then verifies that any thread on any given CPU on the system must be
able to see the same DSCR value whether its is being read through the
problem state based SPR or the privilege state based SPR.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anshuman Khandual [Thu, 21 May 2015 06:43:11 +0000 (12:13 +0530)]
selftests/powerpc: Add test for all DSCR sysfs interfaces
This test continuously updates the system wide DSCR default value in the
sysfs interface and makes sure that the same is reflected across all the
sysfs interfaces for each individual CPUs present on the system.
Acked-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anshuman Khandual [Thu, 21 May 2015 06:43:10 +0000 (12:13 +0530)]
selftests/powerpc: Add test for DSCR inheritence across fork & exec
This patch adds a test case to verify that the changed DSCR value inside
any process would be inherited to it's child across the fork and exec
system call.
Acked-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anshuman Khandual [Thu, 21 May 2015 06:43:09 +0000 (12:13 +0530)]
selftests/powerpc: Add test for DSCR value inheritence across fork
This patch adds a test to verify that the changed DSCR value inside any
process would be inherited to it's child process across the fork system
call.
Acked-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anshuman Khandual [Thu, 21 May 2015 06:43:08 +0000 (12:13 +0530)]
selftests/powerpc: Add test for DSCR SPR numbers
This patch adds a test which verifies that the DSCR privilege and
problem state SPR read & write accesses while making sure that the
results are always the same irrespective of which SPR number is being
used.
Acked-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anshuman Khandual [Thu, 21 May 2015 06:43:07 +0000 (12:13 +0530)]
selftests/powerpc: Add test for explicitly changing DSCR value
This patch adds a test which modifies the DSCR using mtspr instruction
and verifies the change using mfspr instruction. It uses both the
privilege state SPR as well as the problem state SPR for the purpose.
Acked-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anshuman Khandual [Thu, 21 May 2015 06:43:06 +0000 (12:13 +0530)]
selftests/powerpc: Add test for system wide DSCR default
This patch adds a test case for the system wide DSCR default value,
which when changed through it's sysfs interface must be visible to all
threads reading DSCR either through the privilege state SPR or the
problem state SPR. The DSCR value change should be immediate as well.
Acked-by: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anshuman Khandual [Thu, 21 May 2015 06:43:05 +0000 (12:13 +0530)]
powerpc/dscr: Add documentation for DSCR support
This patch adds a new documentation file explaining the DSCR support on
powerpc platforms. This explains DSCR related data structure, code paths
and also available user interfaces. Any further functional changes to
the DSCR support in the kernel should definitely update the
documentation here.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anshuman Khandual [Thu, 21 May 2015 06:43:04 +0000 (12:13 +0530)]
powerpc/dscr: Add some in-code documentation
This patch adds some in-code documentation to the DSCR related code to
make it more readable without having any functional change to it.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anshuman Khandual [Thu, 21 May 2015 06:43:03 +0000 (12:13 +0530)]
powerpc/kernel: Rename PACA_DSCR to PACA_DSCR_DEFAULT
PACA_DSCR offset macro tracks dscr_default element in the paca
structure. Better change the name of this macro to match that of the
data element it tracks. Makes the code more readable.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anshuman Khandual [Thu, 21 May 2015 06:43:02 +0000 (12:13 +0530)]
powerpc/kernel: Remove the unused extern dscr_default
The process context switch code no longer uses dscr_default variable
from the sysfs.c file. The variable became unused when we started
storing the CPU specific DSCR value in the PACA structure instead.
This patch just removes this extern declaration. It was originally
added by the following commit.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anshuman Khandual [Thu, 21 May 2015 06:43:01 +0000 (12:13 +0530)]
powerpc: Fix handling of DSCR related facility unavailable exception
Currently DSCR (Data Stream Control Register) can be accessed with
mfspr or mtspr instructions inside a thread via two different SPR
numbers. One being the user accessible problem state SPR number 0x03
and the other being the privilege state SPR number 0x11. All access
through the privilege state SPR number get emulated through illegal
instruction exception. Any access through the problem state SPR number
raises one facility unavailable exception which sets the thread based
dscr_inherit bit and enables DSCR facility through FSCR register thus
allowing direct access to DSCR without going through this exception in
the future. We set the thread.dscr_inherit bit whether the access was
with mfspr or mtspr instruction which is neither correct nor does it
match the behaviour through the instruction emulation code path driven
from privilege state SPR number. User currently observes two different
kind of behaviour when accessing the DSCR through these two SPR numbers.
This problem can be observed through these two test cases by replacing
the privilege state SPR number with the problem state SPR number.
(1) http://ozlabs.org/~anton/junkcode/dscr_default_test.c
(2) http://ozlabs.org/~anton/junkcode/dscr_explicit_test.c
This patch fixes the problem by making sure that the behaviour visible
to the user remains the same irrespective of which SPR number is being
used. Inside facility unavailable exception, we check whether it was
cuased by a mfspr or a mtspr isntrucction. In case of mfspr instruction,
just emulate the instruction. In case of mtspr instruction, set the
thread based dscr_inherit bit and also enable the facility through FSCR.
All user SPR based mfspr instruction will be emulated till one user SPR
based mtspr has been executed.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Fri, 5 Jun 2015 04:38:26 +0000 (14:38 +1000)]
cxl: Reset default context for vPHB on release
When we release the device, we should also invalidate the default context.
With this cxl_get_context() will return null after removal.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
David Gibson [Wed, 3 Jun 2015 04:52:59 +0000 (14:52 +1000)]
powerpc/eeh: Fix trivial error in eeh_restore_dev_state()
Commit
28158cd "powerpc/eeh: Enhance pcibios_set_pcie_reset_state()"
introduced a fix for a problem where certain configurations could lead to
pci_reset_function() destroying the state of PCI devices other than the one
specified.
Unfortunately, the fix has a trivial bug - it calls pci_save_state() again,
when it should be calling pci_restore_state(). This corrects the problem.
Cc: Gavin Shan <gwshan@au1.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Jeremy Kerr [Thu, 4 Jun 2015 13:51:47 +0000 (21:51 +0800)]
powerpc/powernv: Add opal-prd channel
This change adds a char device to access the "PRD" (processor runtime
diagnostics) channel to OPAL firmware.
Includes contributions from Vaidyanathan Srinivasan, Neelesh Gupta &
Vishal Kulkarni.
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Jeremy Kerr [Wed, 20 May 2015 03:23:33 +0000 (11:23 +0800)]
powerpc/powernv: Expose OPAL APIs required by PRD interface
The (upcoming) opal-prd driver needs to access the message notifier and
xscom code, so add EXPORT_SYMBOL_GPL macros for these.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Jeremy Kerr [Wed, 20 May 2015 03:23:33 +0000 (11:23 +0800)]
powerpc/powernv: Merge common platform device initialisation
opal_ipmi_init and opal_flash_init are equivalent, except for the
compatbile string. Merge these two into a common opal_pdev_init
function.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Wed, 13 May 2015 05:13:18 +0000 (15:13 +1000)]
powerpc/config: Enable bnx2x on ppc64 and pseries defconfigs
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Cédric Le Goater [Wed, 8 Apr 2015 07:31:10 +0000 (09:31 +0200)]
powerpc/powernv: convert OPAL codes returned by sysparam calls
The opal_{get,set}_param calls return internal error codes which need
to be translated in errnos in Linux.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:18 +0000 (16:07 +1000)]
cxl: Add AFU virtual PHB and kernel API
This patch does two things.
Firstly it presents the Accelerator Function Unit (AFUs) behind the POWER
Service Layer (PSL) as PCI devices on a virtual PCI Host Bridge (vPHB). This
in in addition to the PSL being a PCI device itself.
As part of the Coherent Accelerator Interface Architecture (CAIA) AFUs can
provide an AFU configuration. This AFU configuration recored is architected to
be the same as a PCI config space.
This patch sets discovers the AFU configuration records, provides AFU config
space read/write functions to these configuration records. It then enumerates
the PCI bus. It also hooks in PCI ops where appropriate. It also destroys the
vPHB when the physical card is removed.
Secondly, it add an in kernel API for AFU to use CXL. AFUs must present a
driver that firstly binds as a PCI device. This PCI device can then be using
to do CXL specific operations (that can't sit in the PCI ops) using this API.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:17 +0000 (16:07 +1000)]
cxl: Export file ops for use by API
The cxl kernel API will allow drivers other than cxl to export a file
descriptor which has the same userspace API. These file descriptors will be
able to be used against libcxl.
This exports those file ops for use by other drivers.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:16 +0000 (16:07 +1000)]
cxl: Move include file cxl.h -> cxl-base.h
This moves the current include file from cxl.h -> cxl-base.h. This current
include file is used only to pass information between the base driver that
needs to be built into the kernel and the cxl module.
This is to make way for a new include/misc/cxl.h which will
contain just the kernel API for other driver to use
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:15 +0000 (16:07 +1000)]
cxl: Cleanup Makefile
Cleanup Makefile by fixing line wrapping.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:14 +0000 (16:07 +1000)]
cxl: Rework context lifetimes
This reworks contexts lifetimes a bit to enable the kernel API where we may
want to reuse contexts. Here we will want to start and stop contexts without
freeing them.
Start context does the get pid & ctx so stop context will need to do the puts.
Here we move put pid & ctx to the detach context path which will become part of
the stop context path.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:13 +0000 (16:07 +1000)]
cxl: Configure PSL for kernel contexts and merge code
This updates AFU directed and dedicated modes for contexts attached to the
kernel.
The SR (similar to the MSR in the core) calculation is getting
quite complex and is duplicated in AFU directed and dedicated
modes. This patch also merges this SR calculation for these modes.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:12 +0000 (16:07 +1000)]
cxl: Split afu_register_irqs() function
Split the afu_register_irqs() function so that different parts can
be useful elsewhere.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:11 +0000 (16:07 +1000)]
cxl: Only check pid for userspace contexts
We only need to check the pid attached to this context for userspace contexts.
Kernel contexts can skip this check.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:10 +0000 (16:07 +1000)]
cxl: Export some symbols
Export some symbols which will soon be used elsewhere in this driver.
Now they are global we rename them so to avoid collisions.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:09 +0000 (16:07 +1000)]
cxl: cxl_afu_reset() -> __cxl_afu_reset()
Rename cxl_afu_reset() to __cxl_afu_reset() to we can reuse this function name
in the API.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:08 +0000 (16:07 +1000)]
cxl: Rework detach context functions
Rework __detach_context() and cxl_context_detach() so we can reuse them in the
kernel API.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:07 +0000 (16:07 +1000)]
cxl: Add cookie parameter to afu_release_irqs()
Add cookie parameter to afu_release_irqs() so that we can pass in a different
cookie than the context structure. This will be useful for other kernel
drivers that want to call this but get their own cookie back in the interrupt
handler.
Update all existing call sites.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:06 +0000 (16:07 +1000)]
cxl: Dump debug info on the AFU configuration record
Now that we parse the AFU Configuration record, dump some info on it when in
debug mode.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:05 +0000 (16:07 +1000)]
cxl: Fix error path on probe
When probing we call pci_enable_device() but don't call pci_disable_device() on
fail. This causes refcounting issues in the PCI subsystem if a second driver
tries to bind to the same device.
This patch adds the pci_disable_device() to the probe error path. This error
path is hit when this cxl driver tries to bind to AFUs (on the vPHB) rather
than the physical device.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ian Munsie [Wed, 27 May 2015 06:07:04 +0000 (16:07 +1000)]
cxl: Re-order card init to check the VSEC earlier
When we expose AFUs as virtual PCI devices, they may look like the physical
CAPI PCI card. ie they may have the same vendor/device IDs.
We want to avoid these AFUs binding to this driver and any init this driver may
do.
Re-order card init to check the VSEC earlier before assigning BARs or
activating CXL. Also change the dev used in early prints as the adapter struct
may not be inited at this earlier stage.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:03 +0000 (16:07 +1000)]
cxl: Remove unnecessarily verbose print in cxl_remove()
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:02 +0000 (16:07 +1000)]
cxl: Add shutdown hook
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:01 +0000 (16:07 +1000)]
cxl: Document external user of existing API
Now that libcxl is public, let's document it.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:07:00 +0000 (16:07 +1000)]
powerpc/pci: Add pcibios_disable_device() hook
This adds a hook into the powerpc pci code for pci_disable_device() calls. The
generic code already provides a weak pcibios_disable_device() symbol, so we
just need to provide our own in powerpc and it'll get picked up.
This is passed directly to the phb controller ops, provided one exists.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:06:59 +0000 (16:06 +1000)]
powerpc/pci: Add shutdown hook to pci_controller_ops
Currently pnv_pci_shutdown() calls the PHB shutdown code for all PHBs in the
system. It dereferences the private_data assuming it's a powernv PHB, which
won't be the case when we have different PHB in the systems (like when we add
vPHBs for CXL).
This moves the shutdown hook to the pci_controller_ops and fixes the call site
to use that instead.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:06:58 +0000 (16:06 +1000)]
powerpc: Add cxl context to device archdata
Add cxl context pointer to archdata. We'll want to create one of these for cxl
PCI devices. Put them here until we can get a pci_dev specific private data.
This location was suggested by benh.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:06:57 +0000 (16:06 +1000)]
powerpc/pci: Add release_device() hook to phb ops
Add release_device() hook to phb ops so we can clean up for specific phbs.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Wed, 27 May 2015 06:06:56 +0000 (16:06 +1000)]
powerpc/pci: Export symbols for CXL
Export pcibios_claim_one_bus, pcibios_scan_phb and pcibios_alloc_controller.
These will be used by the CXL driver.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Wed, 27 May 2015 06:06:55 +0000 (16:06 +1000)]
powerpc/copro: Fix faulting kernel segments
This fixes calculating the key bits (KP and KS) in the SLB VSID for kernel
mappings.
I'm not CCing this to stable as there are no uses of this currently.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ian Munsie [Fri, 8 May 2015 12:55:18 +0000 (22:55 +1000)]
cxl: Use call_rcu to reduce latency when releasing the afu fd
The afu fd release path was identified as a significant bottleneck in
the overall performance of cxl. While an optimal AFU design would
minimise the need to close & reopen the AFU fd, it is not always
practical to avoid.
The bottleneck seems to be down to the call to synchronize_rcu(), which
will block until every other thread is guaranteed to be out of an RCU
critical section. Replace it with call_rcu() to free the context
structures later so we can return to the application sooner.
This reduces the time spent in the fd release path from 13356 usec to
13.3 usec - about a 100x speed up.
Reported-by: Fei K Chen <uchen@cn.ibm.com>
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Vaibhav Jain [Fri, 22 May 2015 05:26:05 +0000 (10:56 +0530)]
cxl: Export AFU error buffer via sysfs
Export the "AFU Error Buffer" via sysfs attribute (afu_err_buf). AFU
error buffer is used by the AFU to report application specific
errors. The contents of this buffer are AFU specific and are intended to
be interpreted by the application interacting with the afu.
Suggested-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Vaibhav Jain [Wed, 29 Apr 2015 10:17:23 +0000 (15:47 +0530)]
cxl: Implement an ioctl to fetch afu card-id, offset-id and mode
Given a file descriptor on an afu device, libcxl currently uses the
major/minor number obtained from fstat on the fd to construct path to
the afu's sysfs directory. However it is possible that rather than using
one of the device in /dev/cxl, a kernel driver creates its own device
which export generic cxl interface to the userspace. This causes
problems with libcxl as it tries to use a wrong major/minor number to
construct the sysfs path and fail.
So this patch introduces a new ioctl called CXL_IOCTL_GET_AFU_ID on the
afu file descriptor to fetch the cxl_afu_id struct that holds the
card/offset-id and mode information. These info is then used by libcxl to
construct the correct path to the afu sysfs directory.
Testing:
- Build against pseries be/le configs
- Testing with corresponding libcxl changes to verify that it constructs
right sysfs path to the afu.
Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Mon, 1 Jun 2015 09:44:17 +0000 (19:44 +1000)]
selftests/powerpc: Add install support to more powerpc tests
These tests were merged in parallel to the install support, update them
now to use it.
This also adds cross compile support for the VPHN test which was missing
it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Cyril Bur [Tue, 26 May 2015 01:36:57 +0000 (11:36 +1000)]
powerpc/configs: Replace pseries_le_defconfig with a Makefile target using merge_config
Rather than continuing to maintain a copy of pseries_defconfig with
CONFIG_CPU_LITTLE_ENDIAN enabled, use the generic merge_config script
and use an le.config to enable little endian on top of pseries_defconfig
without the need for a duplicated _defconfig file.
This method will require less maintenance in the future and will ensure
that both 'defconfigs' are always in sync.
It is worth noting that the seemingly more simple approach of:
pseries_le_defconfig: pseries_defconfig
$(Q)$(MAKE) le.config
Will not work when building using O=builddir.
The obvious fix to that:
pseries_le_defconfig:
$(Q)$(MAKE) -f $(srctree)/Makefile pseries_defconfig le.config
Also does not work. This is because if we have for example:
config FOO
depends on CPU_BIG_ENDIAN
select BAR
Then BAR will be enabled by the first call to kconfig (via
pseries_defconfig), and then will remain enabled after we merge
le.config, even though FOO will have been turned off.
The solution is to ensure to only invoke the kconfig logic once, after
we have merged all the config fragments. This ensures nothing is
select'ed on that should then be disabled by the later merged configs.
This is done through the explicit call to make olddefconfig
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Reviewed-by: Samuel Mendoza-Jonas <sam.mj@au1.ibm.com>
[mpe: Massage change log, fix white space and use ARCH not SRCARCH]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Cyril Bur [Tue, 26 May 2015 01:36:56 +0000 (11:36 +1000)]
powerpc/configs: Merge pseries_defconfig and pseries_le_defconfig
These two configs should be identical with the exception of big or little
endian.
The big endian version has XMON_DEFAULT turned on while the little endian
has XMON_DEFAULT not set. It makes the most sense for defconfigs not to use
xmon by default, production systems should get back up as quickly as
possible, not sit in xmon.
In the event debugging is required, the option can be enabled or xmon=on
can be specified on commandline.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Jiang Liu [Wed, 20 May 2015 09:59:52 +0000 (17:59 +0800)]
powerpc: Use irq_desc_get_xxx() to avoid redundant lookup of irq_desc
Use irq_desc_get_xxx() to avoid redundant lookup of irq_desc while we
already have a pointer to corresponding irq_desc.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Tue, 26 May 2015 05:46:55 +0000 (15:46 +1000)]
powerpc: Non relocatable system call doesn't need a trampoline
We need to use a trampoline when using LOAD_HANDLER(), because the
destination needs to be in the first 64kB. An absolute branch has
no such limitations, so just jump there.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Tue, 26 May 2015 05:46:54 +0000 (15:46 +1000)]
powerpc: Relocatable system call no longer uses the LR
We had some code to restore the LR in the relocatable system call path
back when we used the LR to do an indirect branch.
Commit
6a404806dfce ("powerpc: Avoid link stack corruption in MMU
on syscall entry path") changed this to use the CTR which is volatile
across system calls so does not need restoring.
Remove the stale comment and the restore of the LR.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Tue, 26 May 2015 05:10:24 +0000 (15:10 +1000)]
powerpc/perf: Fix book3s kernel to userspace backtraces
When we take a PMU exception or a software event we call
perf_read_regs(). This overloads regs->result with a boolean that
describes if we should use the sampled instruction address register
(SIAR) or the regs.
If the exception is in kernel, we start with the kernel regs and
backtrace through the kernel stack. At this point we switch to the
userspace regs and backtrace the user stack with perf_callchain_user().
Unfortunately these regs have not got the perf_read_regs() treatment,
so regs->result could be anything. If it is non zero,
perf_instruction_pointer() decides to use the SIAR, and we get issues
like this:
0.11% qemu-system-ppc [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|
---_raw_spin_lock_irqsave
|
|--52.35%-- 0
| |
| |--46.39%-- __hrtimer_start_range_ns
| | kvmppc_run_core
| | kvmppc_vcpu_run_hv
| | kvmppc_vcpu_run
| | kvm_arch_vcpu_ioctl_run
| | kvm_vcpu_ioctl
| | do_vfs_ioctl
| | sys_ioctl
| | system_call
| | |
| | |--67.08%-- _raw_spin_lock_irqsave <--- hi mum
| | | |
| | | --100.00%-- 0x7e714
| | | 0x7e714
Notice the bogus _raw_spin_irqsave when we transition from kernel
(system_call) to userspace (0x7e714). We inserted what was in the SIAR.
Add a check in regs_use_siar() to check that the regs in question
are from a PMU exception. With this fix the backtrace makes sense:
0.47% qemu-system-ppc [kernel.vmlinux] [k] _raw_spin_lock_irqsave
|
---_raw_spin_lock_irqsave
|
|--53.83%-- 0
| |
| |--44.73%-- hrtimer_try_to_cancel
| | kvmppc_start_thread
| | kvmppc_run_core
| | kvmppc_vcpu_run_hv
| | kvmppc_vcpu_run
| | kvm_arch_vcpu_ioctl_run
| | kvm_vcpu_ioctl
| | do_vfs_ioctl
| | sys_ioctl
| | system_call
| | __ioctl
| | 0x7e714
| | 0x7e714
Cc: stable@vger.kernel.org
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Mon, 1 Jun 2015 11:11:35 +0000 (21:11 +1000)]
powerpc/mm: Fix build break with STRICT_MM_TYPECHECKS && DEBUG_PAGEALLOC
If both STRICT_MM_TYPECHECKS and DEBUG_PAGEALLOC are enabled, the code
in kernel_map_linear_page() is built, and so we fail with:
arch/powerpc/mm/hash_utils_64.c:1478:2:
error: incompatible type for argument 1 of 'htab_convert_pte_flags'
Fix it by using pgprot_val().
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Tue, 28 Apr 2015 05:12:07 +0000 (15:12 +1000)]
powerpc/powernv: Move dma_set_mask() from pnv_phb to pci_controller_ops
Previously, dma_set_mask() on powernv was convoluted:
0) Call dma_set_mask() (a/p/kernel/dma.c)
1) In dma_set_mask(), ppc_md.dma_set_mask() exists, so call it.
2) On powernv, that function pointer is pnv_dma_set_mask().
In pnv_dma_set_mask(), the device is pci, so call pnv_pci_dma_set_mask().
3) In pnv_pci_dma_set_mask(), call pnv_phb->set_dma_mask() if it exists.
4) It only exists in the ioda case, where it points to
pnv_pci_ioda_dma_set_mask(), which is the final function.
So the call chain is:
dma_set_mask() ->
pnv_dma_set_mask() ->
pnv_pci_dma_set_mask() ->
pnv_pci_ioda_dma_set_mask()
Both ppc_md and pnv_phb function pointers are used.
Rip out the ppc_md call, pnv_dma_set_mask() and pnv_pci_dma_set_mask().
Instead:
0) Call dma_set_mask() (a/p/kernel/dma.c)
1) In dma_set_mask(), the device is pci, and pci_controller_ops.dma_set_mask()
exists, so call pci_controller_ops.dma_set_mask()
2) In the ioda case, that points to pnv_pci_ioda_dma_set_mask().
The new call chain is
dma_set_mask() ->
pnv_pci_ioda_dma_set_mask()
Now only the pci_controller_ops function pointer is used.
The fallback paths for p5ioc2 are the same.
Previously, pnv_pci_dma_set_mask() would find no pnv_phb->set_dma_mask()
function, to it would call __set_dma_mask().
Now, dma_set_mask() finds no ppc_md call or pci_controller_ops call,
so it calls __set_dma_mask().
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Tue, 28 Apr 2015 05:12:06 +0000 (15:12 +1000)]
powerpc/pci: add dma_set_mask to pci_controller_ops
Some systems only need to deal with DMA masks for PCI devices.
For these systems, we can avoid the need for a platform hook and
instead use a pci controller based hook.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Tue, 28 Apr 2015 05:12:05 +0000 (15:12 +1000)]
powerpc/powernv: Specialise pci_controller_ops for each controller type
Remove powernv generic PCI controller operations. Replace it with
controller ops for each of the two supported PHBs.
As an added bonus, make the two new structs const, which will help
guard against bugs such as the one introduced in
65ebf4b63
("powerpc/powernv: Move controller ops from ppc_md to controller_ops")
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Tue, 14 Apr 2015 04:28:03 +0000 (14:28 +1000)]
powerpc: Remove MSI-related PCI controller ops from ppc_md
Remove unneeded ppc_md functions. Patch callsites to use pci_controller_ops
functions exclusively.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Tue, 14 Apr 2015 04:28:02 +0000 (14:28 +1000)]
powerpc/mpic_u3msi: Move MSI-related ops to pci_controller_ops
Move the u3 MPIC msi subsystem to use the pci_controller_ops structure
rather than ppc_md for MSI related PCI controller operations.
As with fsl_msi, operations are plugged in at the subsys level, after
controller creation. Again, we iterate over all controllers and
populate them with the MSI ops.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Tue, 14 Apr 2015 04:28:01 +0000 (14:28 +1000)]
powerpc/pasemi: Move MSI-related ops to pci_controller_ops
Move the PaSemi MPIC msi subsystem to use the pci_controller_ops
structure rather than ppc_md for MSI related PCI controller
operations.
As with fsl_msi, operations are plugged in at the subsys level, after
controller creation. Again, we iterate over all controllers and
populate them with the MSI ops.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Tue, 14 Apr 2015 04:28:00 +0000 (14:28 +1000)]
powerpc/ppc4xx_hsta_msi: Move MSI-related ops to pci_controller_ops
Move the ppc4xx hsta msi subsystem to use the pci_controller_ops
structure rather than ppc_md for MSI related PCI controller
operations.
As with fsl_msi, operations are plugged in at the subsys level, after
controller creation. Again, we iterate over all controllers and
populate them with the MSI ops.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Tue, 14 Apr 2015 04:27:59 +0000 (14:27 +1000)]
powerpc/ppc4xx_msi: Move MSI-related ops to pci_controller_ops
Move the ppc4xx msi subsystem to use the pci_controller_ops structure
rather than ppc_md for MSI related PCI controller operations.
As with fsl_msi, operations are plugged in at the subsys level, after
controller creation. Again, we iterate over all controllers and
populate them with the MSI ops.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Tue, 14 Apr 2015 04:27:58 +0000 (14:27 +1000)]
powerpc/fsl_msi: Move MSI-related ops to pci_controller_ops
Move the fsl_msi subsystem to use the pci_controller_ops structure
rather than ppc_md for MSI related PCI controller operations.
Previously, MSI ops were added to ppc_md at the subsys level. However,
in fsl_pci.c, PCI controllers are created at the at arch level. So,
unlike in e.g. PowerNV/pSeries/Cell, we can't simply populate a
platform-level controller ops structure and have it copied into the
controllers when they are created.
Instead, walk every phb, and attempt to populate it with the MSI ops.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Daniel Axtens [Tue, 14 Apr 2015 04:27:57 +0000 (14:27 +1000)]
powerpc/pseries: Move MSI-related ops to pci_controller_ops
Move the pseries platform to use the pci_controller_ops structure
rather than ppc_md for MSI related PCI controller operations
We need to iterate all PHBs because the MSI setup happens later than
find_and_init_phbs() - mpe.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>