| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316 |
- .. SPDX-License-Identifier: GPL-2.0
- =================
- KVM Lock Overview
- =================
- 1. Acquisition Orders
- ---------------------
- The acquisition orders for mutexes are as follows:
- - cpus_read_lock() is taken outside kvm_lock
- - kvm_usage_lock is taken outside cpus_read_lock()
- - kvm->lock is taken outside vcpu->mutex
- - kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock
- - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
- them together is quite rare.
- - kvm->mn_active_invalidate_count ensures that pairs of
- invalidate_range_start() and invalidate_range_end() callbacks
- use the same memslots array. kvm->slots_lock and kvm->slots_arch_lock
- are taken on the waiting side when modifying memslots, so MMU notifiers
- must not take either kvm->slots_lock or kvm->slots_arch_lock.
- cpus_read_lock() vs kvm_lock:
- - Taking cpus_read_lock() outside of kvm_lock is problematic, despite that
- being the official ordering, as it is quite easy to unknowingly trigger
- cpus_read_lock() while holding kvm_lock. Use caution when walking vm_list,
- e.g. avoid complex operations when possible.
- For SRCU:
- - ``synchronize_srcu(&kvm->srcu)`` is called inside critical sections
- for kvm->lock, vcpu->mutex and kvm->slots_lock. These locks _cannot_
- be taken inside a kvm->srcu read-side critical section; that is, the
- following is broken::
- srcu_read_lock(&kvm->srcu);
- mutex_lock(&kvm->slots_lock);
- - kvm->slots_arch_lock instead is released before the call to
- ``synchronize_srcu()``. It _can_ therefore be taken inside a
- kvm->srcu read-side critical section, for example while processing
- a vmexit.
- On x86:
- - vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock and kvm->arch.xen.xen_lock
- - kvm->arch.mmu_lock is an rwlock; critical sections for
- kvm->arch.tdp_mmu_pages_lock and kvm->arch.mmu_unsync_pages_lock must
- also take kvm->arch.mmu_lock
- Everything else is a leaf: no other lock is taken inside the critical
- sections.
- 2. Exception
- ------------
- Fast page fault:
- Fast page fault is the fast path which fixes the guest page fault out of
- the mmu-lock on x86. Currently, the page fault can be fast in one of the
- following two cases:
- 1. Access Tracking: The SPTE is not present, but it is marked for access
- tracking. That means we need to restore the saved R/X bits. This is
- described in more detail later below.
- 2. Write-Protection: The SPTE is present and the fault is caused by
- write-protect. That means we just need to change the W bit of the spte.
- What we use to avoid all the races is the Host-writable bit and MMU-writable bit
- on the spte:
- - Host-writable means the gfn is writable in the host kernel page tables and in
- its KVM memslot.
- - MMU-writable means the gfn is writable in the guest's mmu and it is not
- write-protected by shadow page write-protection.
- On fast page fault path, we will use cmpxchg to atomically set the spte W
- bit if spte.HOST_WRITEABLE = 1 and spte.WRITE_PROTECT = 1, to restore the saved
- R/X bits if for an access-traced spte, or both. This is safe because whenever
- changing these bits can be detected by cmpxchg.
- But we need carefully check these cases:
- 1) The mapping from gfn to pfn
- The mapping from gfn to pfn may be changed since we can only ensure the pfn
- is not changed during cmpxchg. This is a ABA problem, for example, below case
- will happen:
- +------------------------------------------------------------------------+
- | At the beginning:: |
- | |
- | gpte = gfn1 |
- | gfn1 is mapped to pfn1 on host |
- | spte is the shadow page table entry corresponding with gpte and |
- | spte = pfn1 |
- +------------------------------------------------------------------------+
- | On fast page fault path: |
- +------------------------------------+-----------------------------------+
- | CPU 0: | CPU 1: |
- +------------------------------------+-----------------------------------+
- | :: | |
- | | |
- | old_spte = *spte; | |
- +------------------------------------+-----------------------------------+
- | | pfn1 is swapped out:: |
- | | |
- | | spte = 0; |
- | | |
- | | pfn1 is re-alloced for gfn2. |
- | | |
- | | gpte is changed to point to |
- | | gfn2 by the guest:: |
- | | |
- | | spte = pfn1; |
- +------------------------------------+-----------------------------------+
- | :: |
- | |
- | if (cmpxchg(spte, old_spte, old_spte+W) |
- | mark_page_dirty(vcpu->kvm, gfn1) |
- | OOPS!!! |
- +------------------------------------------------------------------------+
- We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
- For direct sp, we can easily avoid it since the spte of direct sp is fixed
- to gfn. For indirect sp, we disabled fast page fault for simplicity.
- A solution for indirect sp could be to pin the gfn, for example via
- gfn_to_pfn_memslot_atomic, before the cmpxchg. After the pinning:
- - We have held the refcount of pfn; that means the pfn can not be freed and
- be reused for another gfn.
- - The pfn is writable and therefore it cannot be shared between different gfns
- by KSM.
- Then, we can ensure the dirty bitmaps is correctly set for a gfn.
- 2) Dirty bit tracking
- In the origin code, the spte can be fast updated (non-atomically) if the
- spte is read-only and the Accessed bit has already been set since the
- Accessed bit and Dirty bit can not be lost.
- But it is not true after fast page fault since the spte can be marked
- writable between reading spte and updating spte. Like below case:
- +------------------------------------------------------------------------+
- | At the beginning:: |
- | |
- | spte.W = 0 |
- | spte.Accessed = 1 |
- +------------------------------------+-----------------------------------+
- | CPU 0: | CPU 1: |
- +------------------------------------+-----------------------------------+
- | In mmu_spte_clear_track_bits():: | |
- | | |
- | old_spte = *spte; | |
- | | |
- | | |
- | /* 'if' condition is satisfied. */| |
- | if (old_spte.Accessed == 1 && | |
- | old_spte.W == 0) | |
- | spte = 0ull; | |
- +------------------------------------+-----------------------------------+
- | | on fast page fault path:: |
- | | |
- | | spte.W = 1 |
- | | |
- | | memory write on the spte:: |
- | | |
- | | spte.Dirty = 1 |
- +------------------------------------+-----------------------------------+
- | :: | |
- | | |
- | else | |
- | old_spte = xchg(spte, 0ull) | |
- | if (old_spte.Accessed == 1) | |
- | kvm_set_pfn_accessed(spte.pfn);| |
- | if (old_spte.Dirty == 1) | |
- | kvm_set_pfn_dirty(spte.pfn); | |
- | OOPS!!! | |
- +------------------------------------+-----------------------------------+
- The Dirty bit is lost in this case.
- In order to avoid this kind of issue, we always treat the spte as "volatile"
- if it can be updated out of mmu-lock [see spte_has_volatile_bits()]; it means
- the spte is always atomically updated in this case.
- 3) flush tlbs due to spte updated
- If the spte is updated from writable to read-only, we should flush all TLBs,
- otherwise rmap_write_protect will find a read-only spte, even though the
- writable spte might be cached on a CPU's TLB.
- As mentioned before, the spte can be updated to writable out of mmu-lock on
- fast page fault path. In order to easily audit the path, we see if TLBs needing
- to be flushed caused this reason in mmu_spte_update() since this is a common
- function to update spte (present -> present).
- Since the spte is "volatile" if it can be updated out of mmu-lock, we always
- atomically update the spte and the race caused by fast page fault can be avoided.
- See the comments in spte_has_volatile_bits() and mmu_spte_update().
- Lockless Access Tracking:
- This is used for Intel CPUs that are using EPT but do not support the EPT A/D
- bits. In this case, PTEs are tagged as A/D disabled (using ignored bits), and
- when the KVM MMU notifier is called to track accesses to a page (via
- kvm_mmu_notifier_clear_flush_young), it marks the PTE not-present in hardware
- by clearing the RWX bits in the PTE and storing the original R & X bits in more
- unused/ignored bits. When the VM tries to access the page later on, a fault is
- generated and the fast page fault mechanism described above is used to
- atomically restore the PTE to a Present state. The W bit is not saved when the
- PTE is marked for access tracking and during restoration to the Present state,
- the W bit is set depending on whether or not it was a write access. If it
- wasn't, then the W bit will remain clear until a write access happens, at which
- time it will be set using the Dirty tracking mechanism described above.
- 3. Reference
- ------------
- ``kvm_lock``
- ^^^^^^^^^^^^
- :Type: mutex
- :Arch: any
- :Protects: - vm_list
- ``kvm_usage_lock``
- ^^^^^^^^^^^^^^^^^^
- :Type: mutex
- :Arch: any
- :Protects: - kvm_usage_count
- - hardware virtualization enable/disable
- :Comment: Exists to allow taking cpus_read_lock() while kvm_usage_count is
- protected, which simplifies the virtualization enabling logic.
- ``kvm->mn_invalidate_lock``
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^
- :Type: spinlock_t
- :Arch: any
- :Protects: mn_active_invalidate_count, mn_memslots_update_rcuwait
- ``kvm_arch::tsc_write_lock``
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- :Type: raw_spinlock_t
- :Arch: x86
- :Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
- - tsc offset in vmcb
- :Comment: 'raw' because updating the tsc offsets must not be preempted.
- ``kvm->mmu_lock``
- ^^^^^^^^^^^^^^^^^
- :Type: spinlock_t or rwlock_t
- :Arch: any
- :Protects: -shadow page/shadow tlb entry
- :Comment: it is a spinlock since it is used in mmu notifier.
- ``kvm->srcu``
- ^^^^^^^^^^^^^
- :Type: srcu lock
- :Arch: any
- :Protects: - kvm->memslots
- - kvm->buses
- :Comment: The srcu read lock must be held while accessing memslots (e.g.
- when using gfn_to_* functions) and while accessing in-kernel
- MMIO/PIO address->device structure mapping (kvm->buses).
- The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
- if it is needed by multiple functions.
- ``kvm->slots_arch_lock``
- ^^^^^^^^^^^^^^^^^^^^^^^^
- :Type: mutex
- :Arch: any (only needed on x86 though)
- :Protects: any arch-specific fields of memslots that have to be modified
- in a ``kvm->srcu`` read-side critical section.
- :Comment: must be held before reading the pointer to the current memslots,
- until after all changes to the memslots are complete
- ``wakeup_vcpus_on_cpu_lock``
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- :Type: spinlock_t
- :Arch: x86
- :Protects: wakeup_vcpus_on_cpu
- :Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts.
- When VT-d posted-interrupts are supported and the VM has assigned
- devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
- protected by blocked_vcpu_on_cpu_lock. When VT-d hardware issues
- wakeup notification event since external interrupts from the
- assigned devices happens, we will find the vCPU on the list to
- wakeup.
- ``vendor_module_lock``
- ^^^^^^^^^^^^^^^^^^^^^^
- :Type: mutex
- :Arch: x86
- :Protects: loading a vendor module (kvm_amd or kvm_intel)
- :Comment: Exists because using kvm_lock leads to deadlock. kvm_lock is taken
- in notifiers, e.g. __kvmclock_cpufreq_notifier(), that may be invoked while
- cpu_hotplug_lock is held, e.g. from cpufreq_boost_trigger_state(), and many
- operations need to take cpu_hotplug_lock when loading a vendor module, e.g.
- updating static calls.
|