Xiao Guangrong [Fri, 4 Jun 2010 13:55:29 +0000 (21:55 +0800)]
KVM: MMU: gather remote tlb flush which occurs during page zapped
Using kvm_mmu_prepare_zap_page() and kvm_mmu_zap_page() instead of
kvm_mmu_zap_page() that can reduce remote tlb flush IPI
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Xiao Guangrong [Fri, 4 Jun 2010 13:54:38 +0000 (21:54 +0800)]
KVM: MMU: don't get free page number in the loop
In the later patch, we will modify sp's zapping way like below:
kvm_mmu_prepare_zap_page A
kvm_mmu_prepare_zap_page B
kvm_mmu_prepare_zap_page C
....
kvm_mmu_commit_zap_page
[ zaped multiple sps only need to call kvm_mmu_commit_zap_page once ]
In __kvm_mmu_free_some_pages() function, the free page number is
getted form 'vcpu->kvm->arch.n_free_mmu_pages' in loop, it will
hinders us to apply kvm_mmu_prepare_zap_page() and kvm_mmu_commit_zap_page()
since kvm_mmu_prepare_zap_page() not free sp.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Xiao Guangrong [Fri, 4 Jun 2010 13:53:54 +0000 (21:53 +0800)]
KVM: MMU: split the operations of kvm_mmu_zap_page()
Using kvm_mmu_prepare_zap_page() and kvm_mmu_commit_zap_page() to
split kvm_mmu_zap_page() function, then we can:
- traverse hlist safely
- easily to gather remote tlb flush which occurs during page zapped
Those feature can be used in the later patches
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Xiao Guangrong [Fri, 4 Jun 2010 13:53:07 +0000 (21:53 +0800)]
KVM: MMU: introduce some macros to cleanup hlist traverseing
Introduce for_each_gfn_sp() and for_each_gfn_indirect_valid_sp() to
cleanup hlist traverseing
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Xiao Guangrong [Fri, 4 Jun 2010 13:52:17 +0000 (21:52 +0800)]
KVM: MMU: skip invalid sp when unprotect page
In kvm_mmu_unprotect_page(), the invalid sp can be skipped
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gui Jianfeng [Fri, 4 Jun 2010 00:51:39 +0000 (08:51 +0800)]
KVM: VMX: Make sure single type invvpid is supported before issuing invvpid instruction
According to SDM, we need check whether single-context INVVPID type is supported
before issuing invvpid instruction.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Reviewed-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Lai Jiangshan [Wed, 2 Jun 2010 09:06:03 +0000 (17:06 +0800)]
KVM: x86: use linux/uaccess.h instead of asm/uaccess.h
Should use linux/uaccess.h instead of asm/uaccess.h
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Lai Jiangshan [Wed, 2 Jun 2010 09:01:23 +0000 (17:01 +0800)]
KVM: cleanup "*new.rmap" type
The type of '*new.rmap' is not 'struct page *', fix it
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Sheng Yang [Wed, 2 Jun 2010 06:05:24 +0000 (14:05 +0800)]
KVM: VMX: Enforce EPT pagetable level checking
We only support 4 levels EPT pagetable now.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Glauber Costa [Tue, 1 Jun 2010 12:22:48 +0000 (08:22 -0400)]
KVM: Add Documentation/kvm/msr.txt
This patch adds a file that documents the usage of KVM-specific
MSRs.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Reviewed-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Andreas Schwab [Mon, 31 May 2010 19:59:13 +0000 (21:59 +0200)]
KVM: PPC: elide struct thread_struct instances from stack
Instead of instantiating a whole thread_struct on the stack use only the
required parts of it.
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Tested-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Mohammed Gamal [Mon, 31 May 2010 19:40:54 +0000 (22:40 +0300)]
KVM: VMX: Properly return error to userspace on vmentry failure
The vmexit handler returns KVM_EXIT_UNKNOWN since there is no handler
for vmentry failures. This intercepts vmentry failures and returns
KVM_FAIL_ENTRY to userspace instead.
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Gui Jianfeng [Mon, 31 May 2010 09:11:39 +0000 (17:11 +0800)]
KVM: MMU: Don't calculate quadrant if tdp_enabled
There's no need to calculate quadrant if tdp is enabled.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Avi Kivity [Thu, 27 May 2010 13:44:12 +0000 (16:44 +0300)]
KVM: MMU: Document large pages
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Avi Kivity [Thu, 27 May 2010 11:46:04 +0000 (14:46 +0300)]
KVM: MMU: Document cr0.wp emulation
Signed-off-by: Avi Kivity <avi@redhat.com>
Avi Kivity [Thu, 27 May 2010 11:22:51 +0000 (14:22 +0300)]
KVM: MMU: Allow spte.w=1 for gpte.w=0 and cr0.wp=0 only in shadow mode
When tdp is enabled, the guest's cr0.wp shouldn't have any effect on spte
permissions.
Signed-off-by: Avi Kivity <avi@redhat.com>
Jan Kiszka [Tue, 25 May 2010 14:01:50 +0000 (16:01 +0200)]
KVM: x86: Propagate fpu_alloc errors
Memory allocation may fail. Propagate such errors.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Zachary Amsden [Thu, 27 May 2010 01:09:43 +0000 (15:09 -1000)]
KVM: SVM: Fix EFER.LME being stripped
Must set VCPU register to be the guest notion of EFER even if that
setting is not valid on hardware. This was masked by the set in
set_efer until
7657fd5ace88e8092f5f3a84117e093d7b893f26 broke that.
Fix is simply to set the VCPU register before stripping bits.
Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gui Jianfeng [Thu, 27 May 2010 08:09:48 +0000 (16:09 +0800)]
KVM: MMU: don't check PT_WRITABLE_MASK directly
Since we have is_writable_pte(), make use of it.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Lai Jiangshan [Wed, 26 May 2010 08:48:19 +0000 (16:48 +0800)]
KVM: MMU: calculate correct gfn for small host pages backing large guest pages
In Documentation/kvm/mmu.txt:
gfn:
Either the guest page table containing the translations shadowed by this
page, or the base page frame for linear translations. See role.direct.
But in function FNAME(fetch)(), sp->gfn is incorrect when one of following
situations occurred:
1) guest is 32bit paging and the guest PDE maps a 4-MByte page
(backed by 4k host pages), FNAME(fetch)() miss handling the quadrant.
And if guest use pse-36, "table_gfn = gpte_to_gfn(gw->ptes[level - delta]);"
is incorrect.
2) guest is long mode paging and the guest PDPTE maps a 1-GByte page
(backed by 4k or 2M host pages).
So we fix it to suit to the document and suit to the code which
requires sp->gfn correct when sp->role.direct=1.
We use the goal mapping gfn(gw->gfn) to calculate the base page frame
for linear translations, it is simple and easy to be understood.
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Reported-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Lai Jiangshan [Wed, 26 May 2010 08:48:25 +0000 (16:48 +0800)]
KVM: MMU: Calculate correct base gfn for direct non-DIR level
In Document/kvm/mmu.txt:
gfn:
Either the guest page table containing the translations shadowed by this
page, or the base page frame for linear translations. See role.direct.
But in __direct_map(), the base gfn calculation is incorrect,
it does not calculate correctly when level=3 or 4.
Fix by using PT64_LVL_ADDR_MASK() which accounts for all levels correctly.
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Lai Jiangshan [Wed, 26 May 2010 08:49:59 +0000 (16:49 +0800)]
KVM: MMU: Don't allocate gfns page for direct mmu pages
When sp->role.direct is set, sp->gfns does not contain any essential
information, leaf sptes reachable from this sp are for a continuous
guest physical memory range (a linear range).
So sp->gfns[i] (if it was set) equals to sp->gfn + i. (PT_PAGE_TABLE_LEVEL)
Obviously, it is not essential information, we can calculate it when need.
It means we don't need sp->gfns when sp->role.direct=1,
Thus we can save one page usage for every kvm_mmu_page.
Note:
Access to sp->gfns must be wrapped by kvm_mmu_page_get_gfn()
or kvm_mmu_page_set_gfn().
It is only exposed in FNAME(sync_page).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Mohammed Gamal [Sun, 23 May 2010 22:01:04 +0000 (01:01 +0300)]
KVM: VMX: Add constant for invalid guest state exit reason
For the sake of completeness, this patch adds a symbolic
constant for VMX exit reason 0x21 (invalid guest state).
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Xiao Guangrong [Mon, 24 May 2010 07:41:33 +0000 (15:41 +0800)]
KVM: MMU: allow more page become unsync at getting sp time
Allow more page become asynchronous at getting sp time, if need create new
shadow page for gfn but it not allow unsync(level > 1), we should unsync all
gfn's unsync page
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Xiao Guangrong [Mon, 24 May 2010 07:40:07 +0000 (15:40 +0800)]
KVM: MMU: allow more page become unsync at gfn mapping time
In current code, shadow page can become asynchronous only if one
shadow page for a gfn, this rule is too strict, in fact, we can
let all last mapping page(i.e, it's the pte page) become unsync,
and sync them at invlpg or flush tlb time.
This patch allow more page become asynchronous at gfn mapping time
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Avi Kivity [Sun, 23 May 2010 15:37:00 +0000 (18:37 +0300)]
KVM: Update Red Hat copyrights
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Sun, 23 May 2010 11:28:26 +0000 (14:28 +0300)]
KVM: SVM: correctly trace irq injection
On SVM interrupts are injected by svm_set_irq() not svm_inject_irq().
The later is used only to wait for irq window.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Xiao Guangrong [Sat, 15 May 2010 10:53:35 +0000 (18:53 +0800)]
KVM: MMU: only update unsync page in invlpg path
Only unsync pages need updated at invlpg time since other shadow
pages are write-protected
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Xiao Guangrong [Sat, 15 May 2010 10:52:34 +0000 (18:52 +0800)]
KVM: MMU: don't write-protect if have new mapping to unsync page
Two cases maybe happen in kvm_mmu_get_page() function:
- one case is, the goal sp is already in cache, if the sp is unsync,
we only need update it to assure this mapping is valid, but not
mark it sync and not write-protect sp->gfn since it not broke unsync
rule(one shadow page for a gfn)
- another case is, the goal sp not existed, we need create a new sp
for gfn, i.e, gfn (may)has another shadow page, to keep unsync rule,
we should sync(mark sync and write-protect) gfn's unsync shadow page.
After enabling multiple unsync shadows, we sync those shadow pages
only when the new sp not allow to become unsync(also for the unsyc
rule, the new rule is: allow all pte page become unsync)
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Xiao Guangrong [Sat, 15 May 2010 10:51:24 +0000 (18:51 +0800)]
KVM: MMU: split kvm_sync_page() function
Split kvm_sync_page() into kvm_sync_page() and kvm_sync_page_transient()
to clarify the code address Avi's suggestion
kvm_sync_page_transient() function only update shadow page but not mark
it sync and not write protect sp->gfn. it will be used by later patch
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Sheng Yang [Mon, 17 May 2010 09:08:28 +0000 (17:08 +0800)]
KVM: x86: Use FPU API
Convert KVM to use generic FPU API.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Sheng Yang [Mon, 17 May 2010 09:08:27 +0000 (17:08 +0800)]
KVM: x86: Use unlazy_fpu() for host FPU
We can avoid unnecessary fpu load when userspace process
didn't use FPU frequently.
Derived from Avi's idea.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Sheng Yang [Mon, 17 May 2010 09:22:23 +0000 (17:22 +0800)]
x86: Export FPU API for KVM use
Also add some constants.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Avi Kivity [Thu, 13 May 2010 09:35:17 +0000 (12:35 +0300)]
KVM: Consolidate arch specific vcpu ioctl locking
Now that all arch specific ioctls have centralized locking, it is easy to
move it to the central dispatcher.
Signed-off-by: Avi Kivity <avi@redhat.com>
Avi Kivity [Thu, 13 May 2010 09:30:43 +0000 (12:30 +0300)]
KVM: PPC: Centralize locking of arch specific vcpu ioctls
Signed-off-by: Avi Kivity <avi@redhat.com>
Avi Kivity [Thu, 13 May 2010 09:21:46 +0000 (12:21 +0300)]
KVM: s390: Centrally lock arch specific vcpu ioctls
Signed-off-by: Avi Kivity <avi@redhat.com>
Avi Kivity [Thu, 13 May 2010 08:53:06 +0000 (11:53 +0300)]
KVM: x86: Lock arch specific vcpu ioctls centrally
Signed-off-by: Avi Kivity <avi@redhat.com>
Avi Kivity [Thu, 13 May 2010 08:25:04 +0000 (11:25 +0300)]
KVM: move vcpu locking to dispatcher for generic vcpu ioctls
All vcpu ioctls need to be locked, so instead of locking each one specifically
we lock at the generic dispatcher.
This patch only updates generic ioctls and leaves arch specific ioctls alone.
Signed-off-by: Avi Kivity <avi@redhat.com>
Xiao Guangrong [Thu, 13 May 2010 02:09:57 +0000 (10:09 +0800)]
KVM: x86: cleanup unused local variable
fix:
arch/x86/kvm/x86.c: In function ‘handle_emulation_failure’:
arch/x86/kvm/x86.c:3844: warning: unused variable ‘ctxt’
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Xiao Guangrong [Thu, 13 May 2010 02:08:08 +0000 (10:08 +0800)]
KVM: MMU: unalias gfn before sp->gfns[] comparison in sync_page
sp->gfns[] contain unaliased gfns, but gpte might contain pointer
to aliased region.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Xiao Guangrong [Thu, 13 May 2010 02:07:00 +0000 (10:07 +0800)]
KVM: MMU: remove rmap before clear spte
Remove rmap before clear spte otherwise it will trigger BUG_ON() in
some functions such as rmap_write_protect().
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Xiao Guangrong [Thu, 13 May 2010 02:06:02 +0000 (10:06 +0800)]
KVM: MMU: use proper cache object freeing function
Use kmem_cache_free to free objects allocated by kmem_cache_alloc.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Alex Williamson [Wed, 12 May 2010 13:46:31 +0000 (09:46 -0400)]
KVM: remove CAP_SYS_RAWIO requirement from kvm_vm_ioctl_assign_irq
Remove this check in an effort to allow kvm guests to run without
root privileges. This capability check doesn't seem to add any
security since the device needs to have already been added via the
assign device ioctl and the io actually occurs through the pci
sysfs interface.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Sheng Yang [Wed, 12 May 2010 08:40:42 +0000 (16:40 +0800)]
KVM: VMX: Only reset MMU when necessary
Only modifying some bits of CR0/CR4 needs paging mode switch.
Modify EFER.NXE bit would result in reserved bit updates.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Sheng Yang [Wed, 12 May 2010 08:40:41 +0000 (16:40 +0800)]
KVM: x86: Clean up duplicate assignment
mmu.free() already set root_hpa to INVALID_PAGE, no need to do it again in the
destory_kvm_mmu().
kvm_x86_ops->set_cr4() and set_efer() already assign cr4/efer to
vcpu->arch.cr4/efer, no need to do it again later.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Mohammed Gamal [Tue, 11 May 2010 22:39:22 +0000 (01:39 +0300)]
KVM: x86 emulator: Add missing decoder flags for xor instructions
This adds missing decoder flags for xor instructions (opcodes 0x34 - 0x35)
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Mohammed Gamal [Tue, 11 May 2010 22:39:21 +0000 (01:39 +0300)]
KVM: x86 emulator: Add missing decoder flags for sub instruction
This adds missing decoder flags for sub instructions (opcodes 0x2c - 0x2d)
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Mohammed Gamal [Tue, 11 May 2010 19:22:40 +0000 (22:22 +0300)]
KVM: x86 emulator: Add test acc, imm instruction (opcodes 0xA8 - 0xA9)
This adds test acc, imm instruction to the x86 emulator
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Marcelo Tosatti [Thu, 13 May 2010 00:00:35 +0000 (21:00 -0300)]
KVM: pass correct parameter to kvm_mmu_free_some_pages
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Dongxiao Xu [Tue, 11 May 2010 10:29:48 +0000 (18:29 +0800)]
KVM: VMX: VMXON/VMXOFF usage changes
SDM suggests VMXON should be called before VMPTRLD, and VMXOFF
should be called after doing VMCLEAR.
Therefore in vmm coexistence case, we should firstly call VMXON
before any VMCS operation, and then call VMXOFF after the
operation is done.
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Dongxiao Xu [Tue, 11 May 2010 10:29:45 +0000 (18:29 +0800)]
KVM: VMX: VMCLEAR/VMPTRLD usage changes
Originally VMCLEAR/VMPTRLD is called on vcpu migration. To
support hosted VMM coexistance, VMCLEAR is executed on vcpu
schedule out, and VMPTRLD is executed on vcpu schedule in.
This could also eliminate the IPI when doing VMCLEAR.
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Dongxiao Xu [Tue, 11 May 2010 10:29:42 +0000 (18:29 +0800)]
KVM: VMX: Some minor changes to code structure
Do some preparations for vmm coexistence support.
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Dongxiao Xu [Tue, 11 May 2010 10:29:38 +0000 (18:29 +0800)]
KVM: VMX: Define new functions to wrapper direct call of asm code
Define vmcs_load() and kvm_cpu_vmxon() to avoid direct call of asm
code. Also move VMXE bit operation out of kvm_cpu_vmxoff().
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Gui Jianfeng [Tue, 11 May 2010 06:36:58 +0000 (14:36 +0800)]
KVM: update mmu documetation for role.nxe
There's no member "cr4_nxe" in struct kvm_mmu_page_role, it names "nxe" now.
Update mmu document.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Avi Kivity [Mon, 10 May 2010 09:09:56 +0000 (12:09 +0300)]
KVM: MMU: Fix free memory accounting race in mmu_alloc_roots()
We drop the mmu lock between freeing memory and allocating the roots; this
allows some other vcpu to sneak in and allocate memory.
While the race is benign (resulting only in temporary overallocation, not oom)
it is simple and easy to fix by moving the freeing close to the allocation.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Gleb Natapov [Mon, 10 May 2010 08:16:56 +0000 (11:16 +0300)]
KVM: inject #UD if instruction emulation fails and exit to userspace
Do not kill VM when instruction emulation fails. Inject #UD and report
failure to userspace instead. Userspace may choose to reenter guest if
vcpu is in userspace (cpl == 3) in which case guest OS will kill
offending process and continue running.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Avi Kivity [Thu, 29 Apr 2010 09:12:57 +0000 (12:12 +0300)]
KVM: Document KVM_SET_BOOT_CPU_ID
Signed-off-by: Avi Kivity <avi@redhat.com>
Avi Kivity [Thu, 29 Apr 2010 09:08:56 +0000 (12:08 +0300)]
KVM: Document KVM_SET_IDENTITY_MAP ioctl
Signed-off-by: Avi Kivity <avi@redhat.com>
Gui Jianfeng [Wed, 5 May 2010 01:03:49 +0000 (09:03 +0800)]
KVM: MMU: make kvm_mmu_zap_page() return the number of pages it actually freed
Currently, kvm_mmu_zap_page() returning the number of freed children sp.
This might confuse the caller, because caller don't know the actual freed
number. Let's make kvm_mmu_zap_page() return the number of pages it actually
freed.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gui Jianfeng [Wed, 5 May 2010 01:58:33 +0000 (09:58 +0800)]
KVM: MMU: Fix debug output error in walk_addr()
Fix a debug output error in walk_addr
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gui Jianfeng [Wed, 5 May 2010 01:09:21 +0000 (09:09 +0800)]
KVM: MMU: mark page table dirty when a pte is actually modified
Sometime cmpxchg_gpte doesn't modify gpte, in such case, don't mark
page table page as dirty.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Joerg Roedel [Wed, 5 May 2010 14:04:44 +0000 (16:04 +0200)]
KVM: SVM: Allow EFER.LMSLE to be set with nested svm
This patch enables setting of efer bit 13 which is allowed
in all SVM capable processors. This is necessary for the
SLES11 version of Xen 4.0 to boot with nested svm.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Joerg Roedel [Wed, 5 May 2010 14:04:42 +0000 (16:04 +0200)]
KVM: SVM: Dump vmcb contents on failed vmrun
This patch adds a function to dump the vmcb into the kernel
log and calls it after a failed vmrun to ease debugging.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Avi Kivity [Mon, 3 May 2010 13:54:48 +0000 (16:54 +0300)]
KVM: Get rid of KVM_REQ_KICK
KVM_REQ_KICK poisons vcpu->requests by having a bit set during normal
operation. This causes the fast path check for a clear vcpu->requests
to fail all the time, triggering tons of atomic operations.
Fix by replacing KVM_REQ_KICK with a vcpu->guest_mode atomic.
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:44 +0000 (19:15 +0300)]
KVM: x86 emulator: do not inject exception directly into vcpu
Return exception as a result of instruction emulation and handle
injection in KVM code.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:43 +0000 (19:15 +0300)]
KVM: x86 emulator: move interruptibility state tracking out of emulator
Emulator shouldn't access vcpu directly.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:42 +0000 (19:15 +0300)]
KVM: x86 emulator: handle shadowed registers outside emulator
Emulator shouldn't access vcpu directly.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:41 +0000 (19:15 +0300)]
KVM: x86 emulator: use shadowed register in emulate_sysexit()
emulate_sysexit() should use shadowed registers copy instead of
looking into vcpu state directly.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:40 +0000 (19:15 +0300)]
KVM: x86 emulator: set RFLAGS outside x86 emulator code
Removes the need for set_flags() callback.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:39 +0000 (19:15 +0300)]
KVM: x86 emulator: advance RIP outside x86 emulator code
Return new RIP as part of instruction emulation result instead of
updating KVM's RIP from x86 emulator code.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:38 +0000 (19:15 +0300)]
KVM: handle emulation failure case first
If emulation failed return immediately.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:37 +0000 (19:15 +0300)]
KVM: do not inject #PF in (read|write)_emulated() callbacks
Return error to x86 emulator instead of injection exception behind its back.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:36 +0000 (19:15 +0300)]
KVM: remove export of emulator_write_emulated()
It is not called directly outside of the file it's defined in anymore.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:35 +0000 (19:15 +0300)]
KVM: x86 emulator: x86_emulate_insn() return -1 only in case of emulation failure
Currently emulator returns -1 when emulation failed or IO is needed.
Caller tries to guess whether emulation failed by looking at other
variables. Make it easier for caller to recognise error condition by
always returning -1 in case of failure. For this new emulator
internal return value X86EMUL_IO_NEEDED is introduced. It is used to
distinguish between error condition (which returns X86EMUL_UNHANDLEABLE)
and condition that requires IO exit to userspace to continue emulation.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:34 +0000 (19:15 +0300)]
KVM: fill in run->mmio details in (read|write)_emulated function
Fill in run->mmio details in (read|write)_emulated function just like
pio does. There is no point in filling only vcpu fields there just to
copy them into vcpu->run a little bit later.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:33 +0000 (19:15 +0300)]
KVM: x86 emulator: fix X86EMUL_RETRY_INSTR and X86EMUL_CMPXCHG_FAILED values
Currently X86EMUL_PROPAGATE_FAULT, X86EMUL_RETRY_INSTR and
X86EMUL_CMPXCHG_FAILED have the same value so caller cannot
distinguish why function such as emulator_cmpxchg_emulated()
(which can return both X86EMUL_PROPAGATE_FAULT and
X86EMUL_CMPXCHG_FAILED) failed.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:32 +0000 (19:15 +0300)]
KVM: x86 emulator: make (get|set)_dr() callback return error if it fails
Make (get|set)_dr() callback return error if it fails instead of
injecting exception behind emulator's back.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:31 +0000 (19:15 +0300)]
KVM: x86 emulator: make set_cr() callback return error if it fails
Make set_cr() callback return error if it fails instead of injecting #GP
behind emulator's back.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:30 +0000 (19:15 +0300)]
KVM: x86 emulator: cleanup some direct calls into kvm to use existing callbacks
Use callbacks from x86_emulate_ops to access segments instead of calling
into kvm directly.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:29 +0000 (19:15 +0300)]
KVM: x86 emulator: add get_cached_segment_base() callback to x86_emulate_ops
On VMX it is expensive to call get_cached_descriptor() just to get segment
base since multiple vmcs_reads are done instead of only one. Introduce
new call back get_cached_segment_base() for efficiency.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:28 +0000 (19:15 +0300)]
KVM: x86 emulator: add (set|get)_msr callbacks to x86_emulate_ops
Add (set|get)_msr callbacks to x86_emulate_ops instead of calling
them directly.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:27 +0000 (19:15 +0300)]
KVM: x86 emulator: add (set|get)_dr callbacks to x86_emulate_ops
Add (set|get)_dr callbacks to x86_emulate_ops instead of calling
them directly.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:26 +0000 (19:15 +0300)]
KVM: x86 emulator: handle "far address" source operand
ljmp/lcall instruction operand contains address and segment.
It can be 10 bytes long. Currently we decode it as two different
operands. Fix it by introducing new kind of operand that can hold
entire far address.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:25 +0000 (19:15 +0300)]
KVM: x86 emulator: cleanup nop emulation
Make it more explicit what we are checking for.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:24 +0000 (19:15 +0300)]
KVM: x86 emulator: cleanup xchg emulation
Dst operand is already initialized during decoding stage. No need to
reinitialize.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:23 +0000 (19:15 +0300)]
KVM: x86 emulator: fix Move r/m16 to segment register decoding
This instruction does not need generic decoding for its dst operand.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Wed, 28 Apr 2010 16:15:22 +0000 (19:15 +0300)]
KVM: x86 emulator: introduce read cache
Introduce read cache which is needed for instruction that require more
then one exit to userspace. After returning from userspace the instruction
will be re-executed with cached read value.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Avi Kivity [Mon, 3 May 2010 13:05:44 +0000 (16:05 +0300)]
KVM: VMX: Avoid writing HOST_CR0 every entry
cr0.ts may change between entries, so we copy cr0 to HOST_CR0 before each
entry. That is slow, so instead, set HOST_CR0 to have TS set unconditionally
(which is a safe value), and issue a clts() just before exiting vcpu context
if the task indeed owns the fpu.
Saves ~50 cycles/exit.
Signed-off-by: Avi Kivity <avi@redhat.com>
Avi Kivity [Tue, 4 May 2010 10:00:55 +0000 (13:00 +0300)]
KVM: kvm_pdptr_read() may sleep
Annotate it thusly.
Signed-off-by: Avi Kivity <avi@redhat.com>
Takuya Yoshikawa [Wed, 28 Apr 2010 09:50:36 +0000 (18:50 +0900)]
KVM: x86: avoid unnecessary bitmap allocation when memslot is clean
Although we always allocate a new dirty bitmap in x86's get_dirty_log(),
it is only used as a zero-source of copy_to_user() and freed right after
that when memslot is clean. This patch uses clear_user() instead of doing
this unnecessary zero-source allocation.
Performance improvement: as we can expect easily, the time needed to
allocate a bitmap is completely reduced. In my test, the improved ioctl
was about 4 to 10 times faster than the original one for clean slots.
Furthermore, reducing memory allocations and copies will produce good
effects to caches too.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
Avi Kivity [Tue, 4 May 2010 09:24:12 +0000 (12:24 +0300)]
KVM: VMX: Simplify vmx_get_nmi_mask()
!! is not needed due to the cast to bool.
Signed-off-by: Avi Kivity <avi@redhat.com>
Huang Ying [Mon, 31 May 2010 06:28:19 +0000 (14:28 +0800)]
KVM: Avoid killing userspace through guest SRAO MCE on unmapped pages
In common cases, guest SRAO MCE will cause corresponding poisoned page
be un-mapped and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay
the MCE to guest OS.
But it is reported that if the poisoned page is accessed in guest
after unmapping and before MCE is relayed to guest OS, userspace will
be killed.
The reason is as follows. Because poisoned page has been un-mapped,
guest access will cause guest exit and kvm_mmu_page_fault will be
called. kvm_mmu_page_fault can not get the poisoned page for fault
address, so kernel and user space MMIO processing is tried in turn. In
user MMIO processing, poisoned page is accessed again, then userspace
is killed by force_sig_info.
To fix the bug, kvm_mmu_page_fault send HWPOISON signal to QEMU-KVM
and do not try kernel and user space MMIO processing for poisoned
page.
[xiao: fix warning introduced by avi]
Reported-by: Max Asbock <masbock@linux.vnet.ibm.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Linus Torvalds [Thu, 29 Jul 2010 03:01:26 +0000 (20:01 -0700)]
Merge branch 'for_linus' of git://git./linux/kernel/git/jwessel/linux-2.6-kgdb
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
x86,kgdb: Fix hw breakpoint regression
Linus Torvalds [Thu, 29 Jul 2010 03:00:42 +0000 (20:00 -0700)]
Merge git://git./linux/kernel/git/jejb/scsi-rc-fixes-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6:
[SCSI] ibmvscsi: Fix oops when an interrupt is pending during probe
[SCSI] zfcp: Update status read mempool
[SCSI] zfcp: Do not wait for SBALs on stopped queue
[SCSI] zfcp: Fix check whether unchained ct_els is possible
[SCSI] ipr: fix resource path display and formatting
Linus Torvalds [Thu, 29 Jul 2010 02:59:55 +0000 (19:59 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/lrg/voltage-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6:
davinci: da850/omap-l138 evm: account for
DEFDCDC{2,3} being tied high
regulator: tps6507x: allow driver to use
DEFDCDC{2,3}_HIGH register
wm8350-regulator: fix wm8350_register_regulator error handling
ab3100: fix off-by-one value range checking for voltage selector
Andre Osterhues [Tue, 13 Jul 2010 20:59:17 +0000 (15:59 -0500)]
ecryptfs: Bugfix for error related to ecryptfs_hash_buckets
The function ecryptfs_uid_hash wrongly assumes that the
second parameter to hash_long() is the number of hash
buckets instead of the number of hash bits.
This patch fixes that and renames the variable
ecryptfs_hash_buckets to ecryptfs_hash_bits to make it
clearer.
Fixes: CVE-2010-2492
Signed-off-by: Andre Osterhues <aosterhues@escrypt.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jason Wessel [Thu, 29 Jul 2010 00:10:30 +0000 (19:10 -0500)]
x86,kgdb: Fix hw breakpoint regression
HW breakpoints events stopped working correctly with kgdb
as a result of commit:
018cbffe6819f6f8db20a0a3acd9bab9bfd667e4
(Merge commit 'v2.6.33' into perf/core).
The regression occurred because the behavior changed for setting
NOTIFY_STOP as the return value to the die notifier if the breakpoint
was known to the HW breakpoint API. Because kgdb is using the HW
breakpoint API to register HW breakpoints slots, it must also now
implement the overflow_handler call back else kgdb does not get to see
the events from the die notifier.
The kgdb_ll_trap function will be changed to be general purpose code
which can allow an easy way to implement the hw_breakpoint API
overflow call back.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Dongdong Deng <dongdong.deng@windriver.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Linus Torvalds [Wed, 28 Jul 2010 18:10:53 +0000 (11:10 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: use complete_all and wake_up_all
ceph: Correct obvious typo of Kconfig variable "CRYPTO_AES"
ceph: fix dentry lease release
ceph: fix leak of dentry in ceph_init_dentry() error path
ceph: fix pg_mapping leak on pg_temp updates
ceph: fix d_release dop for snapdir, snapped dentries
ceph: avoid dcache readdir for snapdir
Steven Whitehouse [Wed, 28 Jul 2010 16:56:23 +0000 (17:56 +0100)]
GFS2: Use kmalloc when possible for ->readdir()
If we don't need a huge amount of memory in ->readdir() then
we can use kmalloc rather than vmalloc to allocate it. This
should cut down on the greater overheads associated with
vmalloc for smaller directories.
We may be able to eliminate vmalloc entirely at some stage,
but this is easy to do right away.
Also using GFP_NOFS to avoid any issues wrt to deleting inodes
while under a glock, and suggestion from Linus to factor out
the alloc/dealloc.
I've given this a test with a variety of different sized
directories and it seems to work ok.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sekhar Nori [Mon, 12 Jul 2010 12:26:21 +0000 (17:56 +0530)]
davinci: da850/omap-l138 evm: account for
DEFDCDC{2,3} being tied high
Per the da850/omap-l138 Beta EVM SOM schematic, the
DEFDCDC2 and
DEFDCDC3 lines are tied high. This leads to a 3.3V IO and 1.2V CVDD
voltage.
Pass the right platform data to the TPS6507x driver so it can operate
on the
DEFDCDC{2,3}_HIGH register to read and change voltage levels.
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Liam Girdwood <lrg@slimlogic.co.uk>