Paolo Bonzini [Sat, 29 Mar 2014 14:44:05 +0000 (15:44 +0100)]
Merge branch 'kvm-ppchv-next' of git://git./linux/kernel/git/paulus/powerpc into kvm-next
Paul Mackerras [Mon, 24 Mar 2014 23:47:08 +0000 (10:47 +1100)]
KVM: PPC: Book3S HV: Save/restore host PMU registers that are new in POWER8
Currently we save the host PMU configuration, counter values, etc.,
when entering a guest, and restore it on return from the guest.
(We have to do this because the guest has control of the PMU while
it is executing.) However, we missed saving/restoring the SIAR and
SDAR registers, as well as the registers which are new on POWER8,
namely SIER and MMCR2.
This adds code to save the values of these registers when entering
the guest and restore them on exit. This also works around the bug
in POWER8 where setting PMAE with a counter already negative doesn't
generate an interrupt.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
Paul Mackerras [Mon, 24 Mar 2014 23:47:07 +0000 (10:47 +1100)]
KVM: PPC: Book3S HV: Fix decrementer timeouts with non-zero TB offset
Commit
c7699822bc21 ("KVM: PPC: Book3S HV: Make physical thread 0 do
the MMU switching") reordered the guest entry/exit code so that most
of the guest register save/restore code happened in guest MMU context.
A side effect of that is that the timebase still contains the guest
timebase value at the point where we compute and use vcpu->arch.dec_expires,
and therefore that is now a guest timebase value rather than a host
timebase value. That in turn means that the timeouts computed in
kvmppc_set_timer() are wrong if the timebase offset for the guest is
non-zero. The consequence of that is things such as "sleep 1" in a
guest after migration may sleep for much longer than they should.
This fixes the problem by converting between guest and host timebase
values as necessary, by adding or subtracting the timebase offset.
This also fixes an incorrect comment.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
Paul Mackerras [Mon, 24 Mar 2014 23:47:06 +0000 (10:47 +1100)]
KVM: PPC: Book3S HV: Don't use kvm_memslots() in real mode
With HV KVM, some high-frequency hypercalls such as H_ENTER are handled
in real mode, and need to access the memslots array for the guest.
Accessing the memslots array is safe, because we hold the SRCU read
lock for the whole time that a guest vcpu is running. However, the
checks that kvm_memslots() does when lockdep is enabled are potentially
unsafe in real mode, when only the linear mapping is available.
Furthermore, kvm_memslots() can be called from a secondary CPU thread,
which is an offline CPU from the point of view of the host kernel,
and is not running the task which holds the SRCU read lock.
To avoid false positives in the checks in kvm_memslots(), and to avoid
possible side effects from doing the checks in real mode, this replaces
kvm_memslots() with kvm_memslots_raw() in all the places that execute
in real mode. kvm_memslots_raw() is a new function that is like
kvm_memslots() but uses rcu_dereference_raw_notrace() instead of
kvm_dereference_check().
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
Paul Mackerras [Mon, 24 Mar 2014 23:47:05 +0000 (10:47 +1100)]
KVM: PPC: Book3S HV: Return ENODEV error rather than EIO
If an attempt is made to load the kvm-hv module on a machine which
doesn't have hypervisor mode available, return an ENODEV error,
which is the conventional thing to return to indicate that this
module is not applicable to the hardware of the current machine,
rather than EIO, which causes a warning to be printed.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
Paul Mackerras [Mon, 24 Mar 2014 23:47:04 +0000 (10:47 +1100)]
KVM: PPC: Book3S: Trim top 4 bits of physical address in RTAS code
The in-kernel emulation of RTAS functions needs to read the argument
buffer from guest memory in order to find out what function is being
requested. The guest supplies the guest physical address of the buffer,
and on a real system the code that reads that buffer would run in guest
real mode. In guest real mode, the processor ignores the top 4 bits
of the address specified in load and store instructions. In order to
emulate that behaviour correctly, we need to mask off those bits
before calling kvm_read_guest() or kvm_write_guest(). This adds that
masking.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
Michael Neuling [Mon, 24 Mar 2014 23:47:03 +0000 (10:47 +1100)]
KVM: PPC: Book3S HV: Add get/set_one_reg for new TM state
This adds code to get/set_one_reg to read and write the new transactional
memory (TM) state.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
Michael Neuling [Mon, 24 Mar 2014 23:47:02 +0000 (10:47 +1100)]
KVM: PPC: Book3S HV: Add transactional memory support
This adds saving of the transactional memory (TM) checkpointed state
on guest entry and exit. We only do this if we see that the guest has
an active transaction.
It also adds emulation of the TM state changes when delivering IRQs
into the guest. According to the architecture, if we are
transactional when an IRQ occurs, the TM state is changed to
suspended, otherwise it's left unchanged.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
Christoffer Dall [Tue, 28 Jan 2014 16:28:42 +0000 (08:28 -0800)]
KVM: Specify byte order for KVM_EXIT_MMIO
The KVM API documentation is not clear about the semantics of the data
field on the mmio struct on the kvm_run struct.
This has become problematic when supporting ARM guests on big-endian
host systems with guests of both endianness types, because it is unclear
how the data should be exported to user space.
This should not break with existing implementations as all supported
existing implementations of known user space applications (QEMU and
kvmtools for virtio) only support default endianness of the
architectures on the host side.
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Alexander Graf <agraf@suse.de>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 26 Mar 2014 14:54:00 +0000 (15:54 +0100)]
KVM: vmx: fix MPX detection
kvm_x86_ops is still NULL at this point. Since kvm_init_msr_list
cannot fail, it is safe to initialize it before the call.
Fixes:
93c4adc7afedf9b0ec190066d45b6d67db5270da
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Jet Chen <jet.chen@intel.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Anton Blanchard [Mon, 24 Mar 2014 23:47:01 +0000 (10:47 +1100)]
KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n
I noticed KVM is broken when KVM in-kernel XICS emulation
(CONFIG_KVM_XICS) is disabled.
The problem was introduced in
48eaef05 (KVM: PPC: Book3S HV: use
xics_wake_cpu only when defined). It used CONFIG_KVM_XICS to wrap
xics_wake_cpu, where CONFIG_PPC_ICP_NATIVE should have been
used.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
Laurent Dufour [Fri, 21 Feb 2014 15:31:10 +0000 (16:31 +0100)]
KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE
This introduces the H_GET_TCE hypervisor call, which is basically the
reverse of H_PUT_TCE, as defined in the Power Architecture Platform
Requirements (PAPR).
The hcall H_GET_TCE is required by the kdump kernel, which uses it to
retrieve TCEs set up by the previous (panicked) kernel.
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Greg Kurz [Thu, 6 Feb 2014 16:36:56 +0000 (17:36 +0100)]
KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write
When the guest does an MMIO write which is handled successfully by an
ioeventfd, ioeventfd_write() returns 0 (success) and
kvmppc_handle_store() returns EMULATE_DONE. Then
kvmppc_emulate_mmio() converts EMULATE_DONE to RESUME_GUEST_NV and
this causes an exit from the loop in kvmppc_vcpu_run_hv(), causing an
exit back to userspace with a bogus exit reason code, typically
causing userspace (e.g. qemu) to crash with a message about an unknown
exit code.
This adds handling of RESUME_GUEST_NV in kvmppc_vcpu_run_hv() in order
to fix that. For generality, we define a helper to check for either
of the return-to-guest codes we use, RESUME_GUEST and RESUME_GUEST_NV,
to make it easy to check for either and provide one place to update if
any other return-to-guest code gets defined in future.
Since it only affects Book3S HV for now, the helper is added to
the kvm_book3s.h header file.
We use the helper in two places in kvmppc_run_core() as well for
future-proofing, though we don't see RESUME_GUEST_NV in either place
at present.
[paulus@samba.org - combined 4 patches into one, rewrote description]
Suggested-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Paolo Bonzini [Tue, 25 Mar 2014 14:44:06 +0000 (15:44 +0100)]
Merge tag 'kvm-s390-
20140325' of git://git./linux/kernel/git/kvms390/linux into kvm-next
3 fixes
- memory leak on certain SIGP conditions
- wrong size for idle bitmap (always too big)
- clear local interrupts on initial CPU reset
1 performance improvement
- improve performance with many guests on certain workloads
Jens Freimann [Tue, 11 Feb 2014 12:48:07 +0000 (13:48 +0100)]
KVM: s390: clear local interrupts at cpu initial reset
Empty list of local interrupts when vcpu goes through initial reset
to provide a clean state
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Thomas Huth [Thu, 20 Mar 2014 12:20:46 +0000 (13:20 +0100)]
KVM: s390: Fix possible memory leak in SIGP functions
When kvm_get_vcpu() returned NULL for the destination CPU in
__sigp_emergency() or __sigp_external_call(), the memory for the
"inti" structure was not released anymore. This patch fixes this
issue by moving the check for !dst_vcpu before the kzalloc() call.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Jens Freimann [Tue, 18 Mar 2014 15:34:18 +0000 (16:34 +0100)]
KVM: s390: fix calculation of idle_mask array size
We need BITS_TO_LONGS, not sizeof(long) to calculate
the correct size.
idle_mask is a bitmask, each bit representing the state
of a cpu. The desired outcome is an array of unsigned long
fields that can fit KVM_MAX_VCPUS bits. We should not use
sizeof(long) which returnes the size in bytes, but BITS_TO_LONGS
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Christian Borntraeger [Wed, 19 Mar 2014 10:18:29 +0000 (11:18 +0100)]
KVM: s390: randomize sca address
We allocate a page for the 2k sca, so lets use the space to improve
hit rate of some internal cpu caches. No need to change the freeing
of the page, as this will shift away the page offset bits anyway.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Paolo Bonzini [Mon, 24 Mar 2014 10:49:13 +0000 (11:49 +0100)]
Merge branch 'kvms390-irqfd' of git://git./linux/kernel/git/kvms390/linux into kvm-next
Paolo Bonzini [Tue, 18 Mar 2014 10:39:23 +0000 (11:39 +0100)]
KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP
After the previous patches, an interrupt whose bit is set in the IRR
register will never be in the LAPIC's IRR and has never been injected
on the migration source. So inject it on the destination.
This fixes migration of Windows guests without HPET (they use the RTC
to trigger the scheduler tick, and lose it after migration).
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cornelia Huck [Tue, 25 Feb 2014 11:48:01 +0000 (12:48 +0100)]
KVM: Bump KVM_MAX_IRQ_ROUTES for s390
The maximum number for irq routes is currently 1024, which is a bit on
the small size for s390: We support up to 4 x 64k virtual devices with
up to 64 queues, and we need one route for each of the queues if we want
to operate it via irqfd.
Let's bump this to 4k on s390 for now, as this at least covers the saner
setups.
We need to find a more general solution, though, as we can't just grow
the routing table indefinitly.
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Cornelia Huck [Mon, 15 Jul 2013 11:36:01 +0000 (13:36 +0200)]
KVM: s390: irq routing for adapter interrupts.
Introduce a new interrupt class for s390 adapter interrupts and enable
irqfds for s390.
This is depending on a new s390 specific vm capability, KVM_CAP_S390_IRQCHIP,
that needs to be enabled by userspace.
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Cornelia Huck [Mon, 15 Jul 2013 11:36:01 +0000 (13:36 +0200)]
KVM: s390: adapter interrupt sources
Add a new interface to register/deregister sources of adapter interrupts
identified by an unique id via the flic. Adapters may also be maskable
and carry a list of pinned pages.
These adapters will be used by irq routing later.
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Cornelia Huck [Wed, 23 Oct 2013 16:26:34 +0000 (18:26 +0200)]
KVM: Add per-vm capability enablement.
Allow KVM_ENABLE_CAP to act on a vm as well as on a vcpu. This makes more
sense when the caller wants to enable a vm-related capability.
s390 will be the first user; wire it up.
Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Paolo Bonzini [Tue, 18 Mar 2014 11:00:14 +0000 (12:00 +0100)]
KVM: ioapic: extract body of kvm_ioapic_set_irq
We will reuse it to process a nonzero IRR that is passed to KVM_SET_IRQCHIP.
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 18 Mar 2014 09:47:17 +0000 (10:47 +0100)]
KVM: ioapic: clear IRR for edge-triggered interrupts at delivery
This ensures that IRR bits are set in the KVM_GET_IRQCHIP result only if
the interrupt is still sitting in the IOAPIC. After the next patches, it
avoids spurious reinjection of the interrupt when KVM_SET_IRQCHIP is
called.
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 18 Mar 2014 10:51:29 +0000 (11:51 +0100)]
KVM: ioapic: merge ioapic_deliver into ioapic_service
Commonize the handling of masking, which was absent for kvm_ioapic_set_irq.
Setting remote_irr does not need a separate function either, and merging
the two functions avoids confusion.
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
James Hogan [Fri, 14 Mar 2014 13:06:10 +0000 (13:06 +0000)]
MIPS: KVM: Remove dead code in CP0 emulation
The code to check whether rd > MIPS_CP0_DESAVE is dead code, since
MIPS_CP0_DESAVE = 31 and rd is already masked with 0x1f. Remove it.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sanjay Lal <sanjayl@kymasys.com>
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
James Hogan [Fri, 14 Mar 2014 13:06:09 +0000 (13:06 +0000)]
MIPS: KVM: Consult HWREna before emulating RDHWR
The ability to read hardware registers from userland with the RDHWR
instruction should depend upon the corresponding bit of the HWREna
register being set, otherwise a reserved instruction exception should be
generated.
However KVM's current emulation ignores the guest's HWREna and always
emulates RDHWR instructions even if the guest OS has disallowed them.
Therefore rework the RDHWR emulation code to check for privilege or the
corresponding bit in the guest HWREna bit. Also remove the #if 0 case
for the UserLocal register. I presume it was there for debug purposes
but it seems unnecessary now that the guest can control whether it
causes a guest exception.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sanjay Lal <sanjayl@kymasys.com>
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
James Hogan [Fri, 14 Mar 2014 13:06:07 +0000 (13:06 +0000)]
MIPS: KVM: Pass reserved instruction exceptions to guest
Previously a reserved instruction exception while in guest code would
cause a KVM internal error if kvm_mips_handle_ri() didn't recognise the
instruction (including a RDHWR from an unrecognised hardware register).
However the guest OS should really have the opportunity to catch the
exception so that it can take the appropriate actions such as sending a
SIGILL to the guest user process or emulating the instruction itself.
Therefore in these cases emulate a guest RI exception and only return
EMULATE_FAIL if that fails, being careful to revert the PC first in case
the exception occurred in a branch delay slot in which case the PC will
already point to the branch target.
Also turn the printk messages relating to these cases into kvm_debug
messages so that they aren't usually visible.
This allows crashme to run in the guest without killing the entire VM.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sanjay Lal <sanjayl@kymasys.com>
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
James Hogan [Fri, 14 Mar 2014 13:06:08 +0000 (13:06 +0000)]
MIPS: KVM: asm/kvm_host.h: Clean up whitespace
The whitespace in asm/kvm_host.h is quite inconsistent in places. Clean
up the whole file to use tabs more consistently.
When you use the --ignore-space-change argument to git diff this patch
only changes line wrapping in TLB_IS_GLOBAL and TLB_IS_VALID macros.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sanjay Lal <sanjayl@kymasys.com>
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cornelia Huck [Mon, 17 Mar 2014 18:11:35 +0000 (19:11 +0100)]
KVM: eventfd: Fix lock order inversion.
When registering a new irqfd, we call its ->poll method to collect any
event that might have previously been pending so that we can trigger it.
This is done under the kvm->irqfds.lock, which means the eventfd's ctx
lock is taken under it.
However, if we get a POLLHUP in irqfd_wakeup, we will be called with the
ctx lock held before getting the irqfds.lock to deactivate the irqfd,
causing lockdep to complain.
Calling the ->poll method does not really need the irqfds.lock, so let's
just move it after we've given up the irqfds.lock in kvm_irqfd_assign().
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 5 Mar 2014 22:19:52 +0000 (23:19 +0100)]
KVM: x86: handle missing MPX in nested virtualization
When doing nested virtualization, we may be able to read BNDCFGS but
still not be allowed to write to GUEST_BNDCFGS in the VMCS. Guard
writes to the field with vmx_mpx_supported(), and similarly hide the
MSR from userspace if the processor does not support the field.
We could work around this with the generic MSR save/load machinery,
but there is only a limited number of MSR save/load slots and it is
not really worthwhile to waste one for a scenario that should not
happen except in the nested virtualization case.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 24 Feb 2014 11:30:04 +0000 (12:30 +0100)]
KVM: x86: Add nested virtualization support for MPX
This is simple to do, the "host" BNDCFGS is either 0 or the guest value.
However, both controls have to be present. We cannot provide MPX if
we only have one of the "load BNDCFGS" or "clear BNDCFGS" controls.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 24 Feb 2014 11:15:16 +0000 (12:15 +0100)]
KVM: x86: introduce kvm_supported_xcr0()
XSAVE support for KVM is already using host_xcr0 & KVM_SUPPORTED_XCR0 as
a "dynamic" version of KVM_SUPPORTED_XCR0.
However, this is not enough because the MPX bits should not be presented
to the guest unless kvm_x86_ops confirms the support. So, replace all
instances of host_xcr0 & KVM_SUPPORTED_XCR0 with a new function
kvm_supported_xcr0() that also has this check.
Note that here:
if (xstate_bv & ~KVM_SUPPORTED_XCR0)
return -EINVAL;
if (xstate_bv & ~host_cr0)
return -EINVAL;
the code is equivalent to
if ((xstate_bv & ~KVM_SUPPORTED_XCR0) ||
(xstate_bv & ~host_cr0)
return -EINVAL;
i.e. "xstate_bv & (~KVM_SUPPORTED_XCR0 | ~host_cr0)" which is in turn
equal to "xstate_bv & ~(KVM_SUPPORTED_XCR0 & host_cr0)". So we should
also use the new function there.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 17 Mar 2014 11:21:35 +0000 (12:21 +0100)]
Merge tag 'kvm-s390-
20140317' of git://git./linux/kernel/git/kvms390/linux into HEAD
Two patches:
- one regression fix for reducing the amount of ucontrol userspace exits
- get rid of BUG_ONs in hot inner loops
Igor Mammedov [Sat, 15 Mar 2014 20:02:00 +0000 (21:02 +0100)]
KVM: x86 emulator: emulate MOVAPD
Add emulation for 0x66 prefixed instruction of 0f 28 opcode
that has been added earlier.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Igor Mammedov [Sat, 15 Mar 2014 20:01:59 +0000 (21:01 +0100)]
KVM: x86 emulator: emulate MOVAPS
HCK memory driver test fails when testing 32-bit Windows 8.1
with baloon driver.
tracing KVM shows error:
reason EXIT_ERR rip 0x81c18326 info 0 0
x/10i 0x81c18326-20
0x0000000081c18312: add %al,(%eax)
0x0000000081c18314: add %cl,-0x7127711d(%esi)
0x0000000081c1831a: rolb $0x0,0x80ec(%ecx)
0x0000000081c18321: and $0xfffffff0,%esp
0x0000000081c18324: mov %esp,%esi
0x0000000081c18326: movaps %xmm0,(%esi)
0x0000000081c18329: movaps %xmm1,0x10(%esi)
0x0000000081c1832d: movaps %xmm2,0x20(%esi)
0x0000000081c18331: movaps %xmm3,0x30(%esi)
0x0000000081c18335: movaps %xmm4,0x40(%esi)
which points to MOVAPS instruction currently no emulated by KVM.
Fix it by adding appropriate entries to opcode table in KVM's emulator.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Christian Borntraeger [Thu, 6 Mar 2014 15:01:38 +0000 (16:01 +0100)]
KVM: s390: Optimize ucontrol path
Since commit
7c470539c95630c1f2a10f109e96f249730b75eb
(s390/kvm: avoid automatic sie reentry) we will run through the C code
of KVM on host interrupts instead of just reentering the guest. This
will result in additional ucontrol exits (at least HZ per second). Let
handle a 0 intercept in the kernel and dont return to userspace,
even if in ucontrol mode.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
CC: stable@vger.kernel.org
Dominik Dingel [Mon, 10 Mar 2014 14:23:34 +0000 (15:23 +0100)]
KVM: s390: Removing untriggerable BUG_ONs
The BUG_ON in kvm-s390.c is unreachable, as we get the vcpu per common code,
which itself does this from the private_data field of the file descriptor,
and there is no KVM_UNCREATE_VCPU.
The __{set,unset}_cpu_idle BUG_ONs are not triggerable because the vcpu
creation code already checks against KVM_MAX_VCPUS.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Paolo Bonzini [Fri, 14 Mar 2014 15:06:30 +0000 (16:06 +0100)]
Merge branch 'kvm-ppc-fix' into HEAD
Gabriel L. Somlo [Fri, 28 Feb 2014 04:06:17 +0000 (23:06 -0500)]
kvm: x86: ignore ioapic polarity
Both QEMU and KVM have already accumulated a significant number of
optimizations based on the hard-coded assumption that ioapic polarity
will always use the ActiveHigh convention, where the logical and
physical states of level-triggered irq lines always match (i.e.,
active(asserted) == high == 1, inactive == low == 0). QEMU guests
are expected to follow directions given via ACPI and configure the
ioapic with polarity 0 (ActiveHigh). However, even when misbehaving
guests (e.g. OS X <= 10.9) set the ioapic polarity to 1 (ActiveLow),
QEMU will still use the ActiveHigh signaling convention when
interfacing with KVM.
This patch modifies KVM to completely ignore ioapic polarity as set by
the guest OS, enabling misbehaving guests to work alongside those which
comply with the ActiveHigh polarity specified by QEMU's ACPI tables.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
[Move documentation to KVM_IRQ_LINE, add ia64. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paul Mackerras [Thu, 13 Mar 2014 09:02:48 +0000 (20:02 +1100)]
KVM: PPC: Book3S HV: Fix register usage when loading/saving VRSAVE
Commit
595e4f7e697e ("KVM: PPC: Book3S HV: Use load/store_fp_state
functions in HV guest entry/exit") changed the register usage in
kvmppc_save_fp() and kvmppc_load_fp() but omitted changing the
instructions that load and save VRSAVE. The result is that the
VRSAVE value was loaded from a constant address, and saved to a
location past the end of the vcpu struct, causing host kernel
memory corruption and various kinds of host kernel crashes.
This fixes the problem by using register r31, which contains the
vcpu pointer, instead of r3 and r4.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paul Mackerras [Thu, 13 Mar 2014 09:02:02 +0000 (20:02 +1100)]
KVM: PPC: Book3S HV: Remove bogus duplicate code
Commit
7b490411c37f ("KVM: PPC: Book3S HV: Add new state for
transactional memory") incorrectly added some duplicate code to the
guest exit path because I didn't manage to clean up after a rebase
correctly. This removes the extraneous material. The presence of
this extraneous code causes host crashes whenever a guest is run.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 21 Feb 2014 09:32:27 +0000 (10:32 +0100)]
KVM: svm: Allow the guest to run with dirty debug registers
When not running in guest-debug mode (i.e. the guest controls the debug
registers, having to take an exit for each DR access is a waste of time.
If the guest gets into a state where each context switch causes DR to be
saved and restored, this can take away as much as 40% of the execution
time from the guest.
If the guest is running with vcpu->arch.db == vcpu->arch.eff_db, we
can let it write freely to the debug registers and reload them on the
next exit. We still need to exit on the first access, so that the
KVM_DEBUGREG_WONT_EXIT flag is set in switch_db_regs; after that, further
accesses to the debug registers will not cause a vmexit.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 3 Mar 2014 12:08:29 +0000 (13:08 +0100)]
KVM: svm: set/clear all DR intercepts in one swoop
Unlike other intercepts, debug register intercepts will be modified
in hot paths if the guest OS is bad or otherwise gets tricked into
doing so.
Avoid calling recalc_intercepts 16 times for debug registers.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 21 Feb 2014 09:36:37 +0000 (10:36 +0100)]
KVM: nVMX: Allow nested guests to run with dirty debug registers
When preparing the VMCS02, the CPU-based execution controls is computed
by vmx_exec_control. Turn off DR access exits there, too, if the
KVM_DEBUGREG_WONT_EXIT bit is set in switch_db_regs.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 21 Feb 2014 09:32:27 +0000 (10:32 +0100)]
KVM: vmx: Allow the guest to run with dirty debug registers
When not running in guest-debug mode (i.e. the guest controls the debug
registers, having to take an exit for each DR access is a waste of time.
If the guest gets into a state where each context switch causes DR to be
saved and restored, this can take away as much as 40% of the execution
time from the guest.
If the guest is running with vcpu->arch.db == vcpu->arch.eff_db, we
can let it write freely to the debug registers and reload them on the
next exit. We still need to exit on the first access, so that the
KVM_DEBUGREG_WONT_EXIT flag is set in switch_db_regs; after that, further
accesses to the debug registers will not cause a vmexit.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 21 Feb 2014 09:17:24 +0000 (10:17 +0100)]
KVM: x86: Allow the guest to run with dirty debug registers
When not running in guest-debug mode, the guest controls the debug
registers and having to take an exit for each DR access is a waste
of time. If the guest gets into a state where each context switch
causes DR to be saved and restored, this can take away as much as 40%
of the execution time from the guest.
After this patch, VMX- and SVM-specific code can set a flag in
switch_db_regs, telling vcpu_enter_guest that on the next exit the debug
registers might be dirty and need to be reloaded (syncing will be taken
care of by a new callback in kvm_x86_ops). This flag can be set on the
first access to a debug registers, so that multiple accesses to the
debug registers only cause one vmexit.
Note that since the guest will be able to read debug registers and
enable breakpoints in DR7, we need to ensure that they are synchronized
on entry to the guest---including DR6 that was not synced before.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 21 Feb 2014 08:55:56 +0000 (09:55 +0100)]
KVM: x86: change vcpu->arch.switch_db_regs to a bit mask
The next patch will add another bit that we can test with the
same "if".
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 21 Feb 2014 09:55:44 +0000 (10:55 +0100)]
KVM: vmx: we do rely on loading DR7 on entry
Currently, this works even if the bit is not in "min", because the bit is always
set in MSR_IA32_VMX_ENTRY_CTLS. Mention it for the sake of documentation, and
to avoid surprises if we later switch to MSR_IA32_VMX_TRUE_ENTRY_CTLS.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jan Kiszka [Fri, 7 Mar 2014 19:03:15 +0000 (20:03 +0100)]
KVM: x86: Remove return code from enable_irq/nmi_window
It's no longer possible to enter enable_irq_window in guest mode when
L1 intercepts external interrupts and we are entering L2. This is now
caught in vcpu_enter_guest. So we can remove the check from the VMX
version of enable_irq_window, thus the need to return an error code from
both enable_irq_window and enable_nmi_window.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jan Kiszka [Fri, 7 Mar 2014 19:03:14 +0000 (20:03 +0100)]
KVM: nVMX: Do not inject NMI vmexits when L2 has a pending interrupt
According to SDM 27.2.3, IDT vectoring information will not be valid on
vmexits caused by external NMIs. So we have to avoid creating such
scenarios by delaying EXIT_REASON_EXCEPTION_NMI injection as long as we
have a pending interrupt because that one would be migrated to L1's IDT
vectoring info on nested exit.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jan Kiszka [Fri, 7 Mar 2014 19:03:13 +0000 (20:03 +0100)]
KVM: nVMX: Fully emulate preemption timer
We cannot rely on the hardware-provided preemption timer support because
we are holding L2 in HLT outside non-root mode. Furthermore, emulating
the preemption will resolve tick rate errata on older Intel CPUs.
The emulation is based on hrtimer which is started on L2 entry, stopped
on L2 exit and evaluated via the new check_nested_events hook. As we no
longer rely on hardware features, we can enable both the preemption
timer support and value saving unconditionally.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jan Kiszka [Fri, 7 Mar 2014 19:03:12 +0000 (20:03 +0100)]
KVM: nVMX: Rework interception of IRQs and NMIs
Move the check for leaving L2 on pending and intercepted IRQs or NMIs
from the *_allowed handler into a dedicated callback. Invoke this
callback at the relevant points before KVM checks if IRQs/NMIs can be
injected. The callback has the task to switch from L2 to L1 if needed
and inject the proper vmexit events.
The rework fixes L2 wakeups from HLT and provides the foundation for
preemption timer emulation.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 6 Mar 2014 11:50:54 +0000 (12:50 +0100)]
Merge tag 'kvm-s390-
20140306' of git://git./linux/kernel/git/kvms390/linux into kvm-next
One fix for virtio-ccw, fixing a problem introduced with
"virtio_ccw: fix vcdev pointer handling issues" and noticed just
after it went into git.
Heinz Graalfs [Wed, 5 Mar 2014 14:23:54 +0000 (15:23 +0100)]
virtio_ccw: fix hang in set offline processing
During set offline processing virtio_grab_drvdata() incorrectly
calls dev_set_drvdata() to remove the virtio_ccw_device from the
parent ccw_device's driver data. This is wrong and ends up in a
hang during virtio_ccw_reset(), as the interrupt handler still
has need of the virtio_ccw_device.
A new field 'going_away' is introduced in struct virtio_ccw_device
to control the usage of the ccw_device's driver data pointer in
virtio_grab_drvdata().
Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Paolo Bonzini [Tue, 4 Mar 2014 14:58:00 +0000 (15:58 +0100)]
Merge tag 'kvm-for-3.15-1' of git://git./linux/kernel/git/maz/arm-platforms into kvm-next
Paolo Bonzini [Tue, 4 Mar 2014 11:32:03 +0000 (12:32 +0100)]
Merge tag 'kvm-s390-
20140304' of git://git./linux/kernel/git/kvms390/linux into kvm-next
Andrew Jones [Fri, 28 Feb 2014 11:52:55 +0000 (12:52 +0100)]
x86: kvm: introduce periodic global clock updates
commit
0061d53daf26f introduced a mechanism to execute a global clock
update for a vm. We can apply this periodically in order to propagate
host NTP corrections. Also, if all vcpus of a vm are pinned, then
without an additional trigger, no guest NTP corrections can propagate
either, as the current trigger is only vcpu cpu migration.
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrew Jones [Fri, 28 Feb 2014 11:52:54 +0000 (12:52 +0100)]
x86: kvm: rate-limit global clock updates
When we update a vcpu's local clock it may pick up an NTP correction.
We can't wait an indeterminate amount of time for other vcpus to pick
up that correction, so commit
0061d53daf26f introduced a global clock
update. However, we can't request a global clock update on every vcpu
load either (which is what happens if the tsc is marked as unstable).
The solution is to rate-limit the global clock updates. Marcelo
calculated that we should delay the global clock updates no more
than 0.1s as follows:
Assume an NTP correction c is applied to one vcpu, but not the other,
then in n seconds the delta of the vcpu system_timestamps will be
c * n. If we assume a correction of 500ppm (worst-case), then the two
vcpus will diverge 50us in 0.1s, which is a considerable amount.
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cornelia Huck [Wed, 6 Feb 2013 09:23:39 +0000 (10:23 +0100)]
virtio-ccw: virtio-ccw adapter interrupt support.
Implement the new CCW_CMD_SET_IND_ADAPTER command and try to enable
adapter interrupts for every device on the first startup. If the host
does not support adapter interrupts, fall back to normal I/O interrupts.
virtio-ccw adapter interrupts use the same isc as normal I/O subchannels
and share a summary indicator for all devices sharing the same indicator
area.
Indicator bits for the individual virtqueues may be contained in the same
indicator area for different devices.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Martin Schwidefsky [Thu, 13 Feb 2014 12:02:32 +0000 (13:02 +0100)]
s390/airq: add support for irq ranges
Add airq_iv_alloc and airq_iv_free to allocate and free consecutive
ranges of irqs from the interrupt vector.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Heinz Graalfs [Thu, 27 Feb 2014 13:34:35 +0000 (14:34 +0100)]
virtio_ccw: fix vcdev pointer handling issues
The interrupt handler virtio_ccw_int_handler() using the vcdev pointer
is protected by the ccw_device lock. Resetting the pointer within the
ccw_device structure should be done when holding this lock.
Also resetting the vcdev pointer (under the ccw_device lock) prior to
freeing the vcdev pointer memory removes a critical path.
Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Jens Freimann [Tue, 25 Feb 2014 14:36:45 +0000 (15:36 +0100)]
KVM: s390: get rid of local_int array
We can use kvm_get_vcpu() now and don't need the
local_int array in the floating_int struct anymore.
This also means we don't have to hold the float_int.lock
in some places.
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Thomas Huth [Mon, 27 Jan 2014 11:06:19 +0000 (12:06 +0100)]
KVM: s390: Fixed CC of SIGP SET_PREFIX handler
When SIGP SET_PREFIX is called with an illegal CPU id, it must return
the condition code 3 ("not operational") instead of 1. Also fixed the
order in which the checks are done - CC3 has a higher priority than CC1.
And while we're at it, this patch also get rid of the floating interrupt
lock here by using kvm_get_vcpu() to get the local_int struct of the
destination CPU.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Jens Freimann [Mon, 24 Feb 2014 09:11:41 +0000 (10:11 +0100)]
KVM: s390: Simplify online vcpus counting for stsi
We don't need to loop over all cpus to get the number of
vcpus. Let's use the available counter online_vcpus instead.
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Christian Borntraeger [Mon, 10 Feb 2014 14:39:23 +0000 (15:39 +0100)]
KVM: s390: expose gbea register to userspace
For migration/reset we want to expose the guest breaking event
address register to userspace. Lets use ONE_REG for that purpose.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Christian Borntraeger [Mon, 10 Feb 2014 14:32:19 +0000 (15:32 +0100)]
KVM: s390: Provide access to program parameter
commit
d208c79d63e06457eef077af770d23dc4cde4d43 (KVM: s390: Enable
the LPP facility for guests) enabled the LPP instruction for guests.
We should expose the program parameter as a pseudo register for
migration/reset etc. Lets also reset this value on initial CPU
reset.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Reviewed-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Paolo Bonzini [Thu, 27 Feb 2014 21:54:11 +0000 (22:54 +0100)]
kvm, vmx: Really fix lazy FPU on nested guest
Commit
e504c9098ed6 (kvm, vmx: Fix lazy FPU on nested guest, 2013-11-13)
highlighted a real problem, but the fix was subtly wrong.
nested_read_cr0 is the CR0 as read by L2, but here we want to look at
the CR0 value reflecting L1's setup. In other words, L2 might think
that TS=0 (so nested_read_cr0 has the bit clear); but if L1 is actually
running it with TS=1, we should inject the fault into L1.
The effective value of CR0 in L2 is contained in vmcs12->guest_cr0, use
it.
Fixes:
e504c9098ed6acd9e1079c5e10e4910724ad429f
Reported-by: Kashyap Chamarty <kchamart@redhat.com>
Reported-by: Stefan Bader <stefan.bader@canonical.com>
Tested-by: Kashyap Chamarty <kchamart@redhat.com>
Tested-by: Anthoine Bourgeois <bourgeois@bertin.fr>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Marc Zyngier [Thu, 30 Jan 2014 17:38:33 +0000 (17:38 +0000)]
ARM: KVM: fix warning in mmu.c
Compiling with THP enabled leads to the following warning:
arch/arm/kvm/mmu.c: In function ‘unmap_range’:
arch/arm/kvm/mmu.c:177:39: warning: ‘pte’ may be used uninitialized in this function [-Wmaybe-uninitialized]
if (kvm_pmd_huge(*pmd) || page_empty(pte)) {
^
Code inspection reveals that these two cases are mutually exclusive,
so GCC is a bit overzealous here. Silence it anyway by initializing
pte to NULL and testing it later on.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Tue, 14 Jan 2014 18:00:55 +0000 (18:00 +0000)]
ARM: KVM: trap VM system registers until MMU and caches are ON
In order to be able to detect the point where the guest enables
its MMU and caches, trap all the VM related system registers.
Once we see the guest enabling both the MMU and the caches, we
can go back to a saner mode of operation, which is to leave these
registers in complete control of the guest.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Wed, 22 Jan 2014 10:20:09 +0000 (10:20 +0000)]
ARM: KVM: add world-switch for AMAIR{0,1}
HCR.TVM traps (among other things) accesses to AMAIR0 and AMAIR1.
In order to minimise the amount of surprise a guest could generate by
trying to access these registers with caches off, add them to the
list of registers we switch/handle.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Marc Zyngier [Wed, 22 Jan 2014 09:43:38 +0000 (09:43 +0000)]
ARM: KVM: introduce per-vcpu HYP Configuration Register
So far, KVM/ARM used a fixed HCR configuration per guest, except for
the VI/VF/VA bits to control the interrupt in absence of VGIC.
With the upcoming need to dynamically reconfigure trapping, it becomes
necessary to allow the HCR to be changed on a per-vcpu basis.
The fix here is to mimic what KVM/arm64 already does: a per vcpu HCR
field, initialized at setup time.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Marc Zyngier [Tue, 21 Jan 2014 18:56:26 +0000 (18:56 +0000)]
ARM: KVM: fix ordering of 64bit coprocessor accesses
Commit
240e99cbd00a (ARM: KVM: Fix 64-bit coprocessor handling)
added an ordering dependency for the 64bit registers.
The order described is: CRn, CRm, Op1, Op2, 64bit-first.
Unfortunately, the implementation is: CRn, 64bit-first, CRm...
Move the 64bit test to be last in order to match the documentation.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Marc Zyngier [Tue, 21 Jan 2014 18:56:26 +0000 (18:56 +0000)]
ARM: KVM: fix handling of trapped 64bit coprocessor accesses
Commit
240e99cbd00a (ARM: KVM: Fix 64-bit coprocessor handling)
changed the way we match the 64bit coprocessor access from
user space, but didn't update the trap handler for the same
set of registers.
The effect is that a trapped 64bit access is never matched, leading
to a fault being injected into the guest. This went unnoticed as we
didn't really trap any 64bit register so far.
Placing the CRm field of the access into the CRn field of the matching
structure fixes the problem. Also update the debug feature to emit the
expected string in case of failing match.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Marc Zyngier [Tue, 14 Jan 2014 19:13:10 +0000 (19:13 +0000)]
ARM: KVM: force cache clean on page fault when caches are off
In order for a guest with caches disabled to observe data written
contained in a given page, we need to make sure that page is
committed to memory, and not just hanging in the cache (as guest
accesses are completely bypassing the cache until it decides to
enable it).
For this purpose, hook into the coherent_cache_guest_page
function and flush the region if the guest SCTLR
register doesn't show the MMU and caches as being enabled.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Marc Zyngier [Wed, 15 Jan 2014 12:50:23 +0000 (12:50 +0000)]
arm64: KVM: flush VM pages before letting the guest enable caches
When the guest runs with caches disabled (like in an early boot
sequence, for example), all the writes are diectly going to RAM,
bypassing the caches altogether.
Once the MMU and caches are enabled, whatever sits in the cache
becomes suddenly visible, which isn't what the guest expects.
A way to avoid this potential disaster is to invalidate the cache
when the MMU is being turned on. For this, we hook into the SCTLR_EL1
trapping code, and scan the stage-2 page tables, invalidating the
pages/sections that have already been mapped in.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Tue, 18 Feb 2014 14:29:03 +0000 (14:29 +0000)]
ARM: KVM: introduce kvm_p*d_addr_end
The use of p*d_addr_end with stage-2 translation is slightly dodgy,
as the IPA is 40bits, while all the p*d_addr_end helpers are
taking an unsigned long (arm64 is fine with that as unligned long
is 64bit).
The fix is to introduce 64bit clean versions of the same helpers,
and use them in the stage-2 page table code.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Tue, 14 Jan 2014 18:00:55 +0000 (18:00 +0000)]
arm64: KVM: trap VM system registers until MMU and caches are ON
In order to be able to detect the point where the guest enables
its MMU and caches, trap all the VM related system registers.
Once we see the guest enabling both the MMU and the caches, we
can go back to a saner mode of operation, which is to leave these
registers in complete control of the guest.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Tue, 21 Jan 2014 10:55:17 +0000 (10:55 +0000)]
arm64: KVM: allows discrimination of AArch32 sysreg access
The current handling of AArch32 trapping is slightly less than
perfect, as it is not possible (from a handler point of view)
to distinguish it from an AArch64 access, nor to tell a 32bit
from a 64bit access either.
Fix this by introducing two additional flags:
- is_aarch32: true if the access was made in AArch32 mode
- is_32bit: true if is_aarch32 == true and a MCR/MRC instruction
was used to perform the access (as opposed to MCRR/MRRC).
This allows a handler to cover all the possible conditions in which
a system register gets trapped.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Tue, 14 Jan 2014 19:13:10 +0000 (19:13 +0000)]
arm64: KVM: force cache clean on page fault when caches are off
In order for the guest with caches off to observe data written
contained in a given page, we need to make sure that page is
committed to memory, and not just hanging in the cache (as
guest accesses are completely bypassing the cache until it
decides to enable it).
For this purpose, hook into the coherent_icache_guest_page
function and flush the region if the guest SCTLR_EL1
register doesn't show the MMU and caches as being enabled.
The function also get renamed to coherent_cache_guest_page.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Paolo Bonzini [Thu, 27 Feb 2014 21:54:11 +0000 (22:54 +0100)]
kvm, vmx: Really fix lazy FPU on nested guest
Commit
e504c9098ed6 (kvm, vmx: Fix lazy FPU on nested guest, 2013-11-13)
highlighted a real problem, but the fix was subtly wrong.
nested_read_cr0 is the CR0 as read by L2, but here we want to look at
the CR0 value reflecting L1's setup. In other words, L2 might think
that TS=0 (so nested_read_cr0 has the bit clear); but if L1 is actually
running it with TS=1, we should inject the fault into L1.
The effective value of CR0 in L2 is contained in vmcs12->guest_cr0, use
it.
Fixes:
e504c9098ed6acd9e1079c5e10e4910724ad429f
Reported-by: Kashyap Chamarty <kchamart@redhat.com>
Reported-by: Stefan Bader <stefan.bader@canonical.com>
Tested-by: Kashyap Chamarty <kchamart@redhat.com>
Tested-by: Anthoine Bourgeois <bourgeois@bertin.fr>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrew Honig [Thu, 27 Feb 2014 18:35:14 +0000 (19:35 +0100)]
kvm: x86: fix emulator buffer overflow (CVE-2014-0049)
The problem occurs when the guest performs a pusha with the stack
address pointing to an mmio address (or an invalid guest physical
address) to start with, but then extending into an ordinary guest
physical address. When doing repeated emulated pushes
emulator_read_write sets mmio_needed to 1 on the first one. On a
later push when the stack points to regular memory,
mmio_nr_fragments is set to 0, but mmio_is_needed is not set to 0.
As a result, KVM exits to userspace, and then returns to
complete_emulated_mmio. In complete_emulated_mmio
vcpu->mmio_cur_fragment is incremented. The termination condition of
vcpu->mmio_cur_fragment == vcpu->mmio_nr_fragments is never achieved.
The code bounces back and fourth to userspace incrementing
mmio_cur_fragment past it's buffer. If the guest does nothing else it
eventually leads to a a crash on a memcpy from invalid memory address.
However if a guest code can cause the vm to be destroyed in another
vcpu with excellent timing, then kvm_clear_async_pf_completion_queue
can be used by the guest to control the data that's pointed to by the
call to cancel_work_item, which can be used to gain execution.
Fixes:
f78146b0f9230765c6315b2e14f56112513389ad
Signed-off-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org (3.5+)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Marc Zyngier [Wed, 26 Feb 2014 18:47:36 +0000 (18:47 +0000)]
arm/arm64: KVM: detect CPU reset on CPU_PM_EXIT
Commit
1fcf7ce0c602 (arm: kvm: implement CPU PM notifier) added
support for CPU power-management, using a cpu_notifier to re-init
KVM on a CPU that entered CPU idle.
The code assumed that a CPU entering idle would actually be powered
off, loosing its state entierely, and would then need to be
reinitialized. It turns out that this is not always the case, and
some HW performs CPU PM without actually killing the core. In this
case, we try to reinitialize KVM while it is still live. It ends up
badly, as reported by Andre Przywara (using a Calxeda Midway):
[ 3.663897] Kernel panic - not syncing: unexpected prefetch abort in Hyp mode at: 0x685760
[ 3.663897] unexpected data abort in Hyp mode at: 0xc067d150
[ 3.663897] unexpected HVC/SVC trap in Hyp mode at: 0xc0901dd0
The trick here is to detect if we've been through a full re-init or
not by looking at HVBAR (VBAR_EL2 on arm64). This involves
implementing the backend for __hyp_get_vectors in the main KVM HYP
code (rather small), and checking the return value against the
default one when the CPU notifier is called on CPU_PM_EXIT.
Reported-by: Andre Przywara <osp@andrep.de>
Tested-by: Andre Przywara <osp@andrep.de>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Rob Herring <rob.herring@linaro.org>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Takuya Yoshikawa [Thu, 27 Feb 2014 06:08:31 +0000 (15:08 +0900)]
KVM: x86: Break kvm_for_each_vcpu loop after finding the VP_INDEX
No need to scan the entire VCPU array.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Michael Mueller [Wed, 26 Feb 2014 15:14:19 +0000 (16:14 +0100)]
KVM/s390: Set preempted flag during vcpu wakeup and interrupt delivery
Commit "kvm: Record the preemption status of vcpus using preempt notifiers"
caused a performance regression on s390. It turned out that in the case that
if a former sleeping cpu, that was woken up, this cpu is not a yield candidate
since it gave up the cpu voluntarily. To retain this candiate its preempted
flag is set during wakeup and interrupt delivery time.
Significant performance measurement work and code analysis to solve this
issue was provided by Mao Chuan Li and his team in Beijing.
Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Michael Mueller [Wed, 26 Feb 2014 15:14:18 +0000 (16:14 +0100)]
KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop
Use the arch specific function kvm_arch_vcpu_runnable() to add a further
criterium to identify a suitable vcpu to yield to during undirected yield
processing.
Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Michael Mueller [Wed, 26 Feb 2014 15:14:17 +0000 (16:14 +0100)]
KVM: s390: implementation of kvm_arch_vcpu_runnable()
A vcpu is defined to be runnable if an interrupt is pending.
Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Marcelo Tosatti [Mon, 24 Feb 2014 16:59:32 +0000 (13:59 -0300)]
KVM: MMU: drop read-only large sptes when creating lower level sptes
Read-only large sptes can be created due to read-only faults as
follows:
- QEMU pagetable entry that maps guest memory is read-only
due to COW.
- Guest read faults such memory, COW is not broken, because
it is a read-only fault.
- Enable dirty logging, large spte not nuked because it is read-only.
- Write-fault on such memory causes guest to loop endlessly
(which must go down to level 1 because dirty logging is enabled).
Fix by dropping large spte when necessary.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Marcelo Tosatti [Wed, 26 Feb 2014 01:44:54 +0000 (22:44 -0300)]
KVM: x86: emulator_cmpxchg_emulated should mark_page_dirty
emulator_cmpxchg_emulated writes to guest memory, therefore it should
update the dirty bitmap accordingly.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Liu, Jinsong [Mon, 24 Feb 2014 10:58:09 +0000 (10:58 +0000)]
KVM: x86: Enable Intel MPX for guest
From
44c2abca2c2eadc6f2f752b66de4acc8131880c4 Mon Sep 17 00:00:00 2001
From: Liu Jinsong <jinsong.liu@intel.com>
Date: Mon, 24 Feb 2014 18:12:31 +0800
Subject: [PATCH v5 3/3] KVM: x86: Enable Intel MPX for guest
This patch enable Intel MPX feature to guest.
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Liu, Jinsong [Mon, 24 Feb 2014 10:56:53 +0000 (10:56 +0000)]
KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save
From
5d5a80cd172ea6fb51786369bcc23356b1e9e956 Mon Sep 17 00:00:00 2001
From: Liu Jinsong <jinsong.liu@intel.com>
Date: Mon, 24 Feb 2014 18:11:55 +0800
Subject: [PATCH v5 2/3] KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save
Add MSR_IA32_BNDCFGS to msrs_to_save, and corresponding logic
to kvm_get/set_msr().
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Liu, Jinsong [Mon, 24 Feb 2014 10:55:46 +0000 (10:55 +0000)]
KVM: x86: Intel MPX vmx and msr handle
From
caddc009a6d2019034af8f2346b2fd37a81608d0 Mon Sep 17 00:00:00 2001
From: Liu Jinsong <jinsong.liu@intel.com>
Date: Mon, 24 Feb 2014 18:11:11 +0800
Subject: [PATCH v5 1/3] KVM: x86: Intel MPX vmx and msr handle
This patch handle vmx and msr of Intel MPX feature.
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Liu, Jinsong [Fri, 21 Feb 2014 17:39:02 +0000 (17:39 +0000)]
KVM: x86: Fix xsave cpuid exposing bug
From
00c920c96127d20d4c3bb790082700ae375c39a0 Mon Sep 17 00:00:00 2001
From: Liu Jinsong <jinsong.liu@intel.com>
Date: Fri, 21 Feb 2014 23:47:18 +0800
Subject: [PATCH] KVM: x86: Fix xsave cpuid exposing bug
EBX of cpuid(0xD, 0) is dynamic per XCR0 features enable/disable.
Bit 63 of XCR0 is reserved for future expansion.
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Liu, Jinsong [Fri, 21 Feb 2014 17:36:12 +0000 (17:36 +0000)]
KVM: x86: expose ADX feature to guest
From
0750e335eb5860b0b483e217e8a08bd743cbba16 Mon Sep 17 00:00:00 2001
From: Liu Jinsong <jinsong.liu@intel.com>
Date: Thu, 20 Feb 2014 17:39:32 +0800
Subject: [PATCH] KVM: x86: expose ADX feature to guest
ADCX and ADOX instructions perform an unsigned addition with Carry flag and
Overflow flag respectively.
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Liu, Jinsong [Fri, 21 Feb 2014 17:33:32 +0000 (17:33 +0000)]
KVM: x86: expose new instruction RDSEED to guest
From
24ffdce9efebf13c6ed4882f714b2b57ef1141eb Mon Sep 17 00:00:00 2001
From: Liu Jinsong <jinsong.liu@intel.com>
Date: Thu, 20 Feb 2014 17:38:26 +0800
Subject: [PATCH] KVM: x86: expose new instruction RDSEED to guest
RDSEED instruction return a random number, which supplied by a
cryptographically secure, deterministic random bit generator(DRBG).
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fernando Luis Vázquez Cao [Tue, 18 Feb 2014 10:09:11 +0000 (19:09 +0900)]
kvm: remove redundant registration of BSP's hv_clock area
These days hv_clock allocation is memblock based (i.e. the percpu
allocator is not involved), which means that the physical address
of each of the per-cpu hv_clock areas is guaranteed to remain
unchanged through all its lifetime and we do not need to update
its location after CPU bring-up.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Radim Krčmář [Fri, 17 Jan 2014 19:52:42 +0000 (20:52 +0100)]
KVM: SVM: fix NMI window after iret
We should open NMI window right after an iret, but SVM exits before it.
We wanted to single step using the trap flag and then open it.
(or we could emulate the iret instead)
We don't do it since commit
3842d135ff2 (likely), because the iret exit
handler does not request an event, so NMI window remains closed until
the next exit.
Fix this by making KVM_REQ_EVENT request in the iret handler.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Takuya Yoshikawa [Tue, 18 Feb 2014 08:22:47 +0000 (17:22 +0900)]
KVM: Simplify kvm->tlbs_dirty handling
When this was introduced, kvm_flush_remote_tlbs() could be called
without holding mmu_lock. It is now acknowledged that the function
must be called before releasing mmu_lock, and all callers have already
been changed to do so.
There is no need to use smp_mb() and cmpxchg() any more.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>