| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298 |
- =============================
- Examining Process Page Tables
- =============================
- pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
- userspace programs to examine the page tables and related information by
- reading files in ``/proc``.
- There are four components to pagemap:
- * ``/proc/pid/pagemap``. This file lets a userspace process find out which
- physical frame each virtual page is mapped to. It contains one 64-bit
- value for each virtual page, containing the following data (from
- ``fs/proc/task_mmu.c``, above pagemap_read):
- * Bits 0-54 page frame number (PFN) if present
- * Bits 0-4 swap type if swapped
- * Bits 5-54 swap offset if swapped
- * Bit 55 pte is soft-dirty (see
- Documentation/admin-guide/mm/soft-dirty.rst)
- * Bit 56 page exclusively mapped (since 4.2)
- * Bit 57 pte is uffd-wp write-protected (since 5.13) (see
- Documentation/admin-guide/mm/userfaultfd.rst)
- * Bits 58-60 zero
- * Bit 61 page is file-page or shared-anon (since 3.5)
- * Bit 62 page swapped
- * Bit 63 page present
- Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
- In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from
- 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
- Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
- If the page is not present but in swap, then the PFN contains an
- encoding of the swap file number and the page's offset into the
- swap. Unmapped pages return a null PFN. This allows determining
- precisely which pages are mapped (or in swap) and comparing mapped
- pages between processes.
- Efficient users of this interface will use ``/proc/pid/maps`` to
- determine which areas of memory are actually mapped and llseek to
- skip over unmapped regions.
- * ``/proc/kpagecount``. This file contains a 64-bit count of the number of
- times each page is mapped, indexed by PFN.
- The page-types tool in the tools/mm directory can be used to query the
- number of times a page is mapped.
- * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
- page, indexed by PFN.
- The flags are (from ``fs/proc/page.c``, above kpageflags_read):
- 0. LOCKED
- 1. ERROR
- 2. REFERENCED
- 3. UPTODATE
- 4. DIRTY
- 5. LRU
- 6. ACTIVE
- 7. SLAB
- 8. WRITEBACK
- 9. RECLAIM
- 10. BUDDY
- 11. MMAP
- 12. ANON
- 13. SWAPCACHE
- 14. SWAPBACKED
- 15. COMPOUND_HEAD
- 16. COMPOUND_TAIL
- 17. HUGE
- 18. UNEVICTABLE
- 19. HWPOISON
- 20. NOPAGE
- 21. KSM
- 22. THP
- 23. OFFLINE
- 24. ZERO_PAGE
- 25. IDLE
- 26. PGTABLE
- * ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the
- memory cgroup each page is charged to, indexed by PFN. Only available when
- CONFIG_MEMCG is set.
- Short descriptions to the page flags
- ====================================
- 0 - LOCKED
- The page is being locked for exclusive access, e.g. by undergoing read/write
- IO.
- 7 - SLAB
- The page is managed by the SLAB/SLUB kernel memory allocator.
- When compound page is used, either will only set this flag on the head
- page.
- 10 - BUDDY
- A free memory block managed by the buddy system allocator.
- The buddy system organizes free memory in blocks of various orders.
- An order N block has 2^N physically contiguous pages, with the BUDDY flag
- set for and _only_ for the first page.
- 15 - COMPOUND_HEAD
- A compound page with order N consists of 2^N physically contiguous pages.
- A compound page with order 2 takes the form of "HTTT", where H donates its
- head page and T donates its tail page(s). The major consumers of compound
- pages are hugeTLB pages (Documentation/admin-guide/mm/hugetlbpage.rst),
- the SLUB etc. memory allocators and various device drivers.
- However in this interface, only huge/giga pages are made visible
- to end users.
- 16 - COMPOUND_TAIL
- A compound page tail (see description above).
- 17 - HUGE
- This is an integral part of a HugeTLB page.
- 19 - HWPOISON
- Hardware detected memory corruption on this page: don't touch the data!
- 20 - NOPAGE
- No page frame exists at the requested address.
- 21 - KSM
- Identical memory pages dynamically shared between one or more processes.
- 22 - THP
- Contiguous pages which construct THP of any size and mapped by any granularity.
- 23 - OFFLINE
- The page is logically offline.
- 24 - ZERO_PAGE
- Zero page for pfn_zero or huge_zero page.
- 25 - IDLE
- The page has not been accessed since it was marked idle (see
- Documentation/admin-guide/mm/idle_page_tracking.rst).
- Note that this flag may be stale in case the page was accessed via
- a PTE. To make sure the flag is up-to-date one has to read
- ``/sys/kernel/mm/page_idle/bitmap`` first.
- 26 - PGTABLE
- The page is in use as a page table.
- IO related page flags
- ---------------------
- 1 - ERROR
- IO error occurred.
- 3 - UPTODATE
- The page has up-to-date data.
- ie. for file backed page: (in-memory data revision >= on-disk one)
- 4 - DIRTY
- The page has been written to, hence contains new data.
- i.e. for file backed page: (in-memory data revision > on-disk one)
- 8 - WRITEBACK
- The page is being synced to disk.
- LRU related page flags
- ----------------------
- 5 - LRU
- The page is in one of the LRU lists.
- 6 - ACTIVE
- The page is in the active LRU list.
- 18 - UNEVICTABLE
- The page is in the unevictable (non-)LRU list It is somehow pinned and
- not a candidate for LRU page reclaims, e.g. ramfs pages,
- shmctl(SHM_LOCK) and mlock() memory segments.
- 2 - REFERENCED
- The page has been referenced since last LRU list enqueue/requeue.
- 9 - RECLAIM
- The page will be reclaimed soon after its pageout IO completed.
- 11 - MMAP
- A memory mapped page.
- 12 - ANON
- A memory mapped page that is not part of a file.
- 13 - SWAPCACHE
- The page is mapped to swap space, i.e. has an associated swap entry.
- 14 - SWAPBACKED
- The page is backed by swap/RAM.
- The page-types tool in the tools/mm directory can be used to query the
- above flags.
- Exceptions for Shared Memory
- ============================
- Page table entries for shared pages are cleared when the pages are zapped or
- swapped out. This makes swapped out pages indistinguishable from never-allocated
- ones.
- In kernel space, the swap location can still be retrieved from the page cache.
- However, values stored only on the normal PTE get lost irretrievably when the
- page is swapped out (i.e. SOFT_DIRTY).
- In user space, whether the page is present, swapped or none can be deduced with
- the help of lseek and/or mincore system calls.
- lseek() can differentiate between accessed pages (present or swapped out) and
- holes (none/non-allocated) by specifying the SEEK_DATA flag on the file where
- the pages are backed. For anonymous shared pages, the file can be found in
- ``/proc/pid/map_files/``.
- mincore() can differentiate between pages in memory (present, including swap
- cache) and out of memory (swapped out or none/non-allocated).
- Other notes
- ===========
- Reading from any of the files will return -EINVAL if you are not starting
- the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
- into the file), or if the size of the read is not a multiple of 8 bytes.
- Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
- always 12 at most architectures). Since Linux 3.11 their meaning changes
- after first clear of soft-dirty bits. Since Linux 4.2 they are used for
- flags unconditionally.
- Pagemap Scan IOCTL
- ==================
- The ``PAGEMAP_SCAN`` IOCTL on the pagemap file can be used to get or optionally
- clear the info about page table entries. The following operations are supported
- in this IOCTL:
- - Scan the address range and get the memory ranges matching the provided criteria.
- This is performed when the output buffer is specified.
- - Write-protect the pages. The ``PM_SCAN_WP_MATCHING`` is used to write-protect
- the pages of interest. The ``PM_SCAN_CHECK_WPASYNC`` aborts the operation if
- non-Async Write Protected pages are found. The ``PM_SCAN_WP_MATCHING`` can be
- used with or without ``PM_SCAN_CHECK_WPASYNC``.
- - Both of those operations can be combined into one atomic operation where we can
- get and write protect the pages as well.
- Following flags about pages are currently supported:
- - ``PAGE_IS_WPALLOWED`` - Page has async-write-protection enabled
- - ``PAGE_IS_WRITTEN`` - Page has been written to from the time it was write protected
- - ``PAGE_IS_FILE`` - Page is file backed
- - ``PAGE_IS_PRESENT`` - Page is present in the memory
- - ``PAGE_IS_SWAPPED`` - Page is in swapped
- - ``PAGE_IS_PFNZERO`` - Page has zero PFN
- - ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed
- - ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty
- The ``struct pm_scan_arg`` is used as the argument of the IOCTL.
- 1. The size of the ``struct pm_scan_arg`` must be specified in the ``size``
- field. This field will be helpful in recognizing the structure if extensions
- are done later.
- 2. The flags can be specified in the ``flags`` field. The ``PM_SCAN_WP_MATCHING``
- and ``PM_SCAN_CHECK_WPASYNC`` are the only added flags at this time. The get
- operation is optionally performed depending upon if the output buffer is
- provided or not.
- 3. The range is specified through ``start`` and ``end``.
- 4. The walk can abort before visiting the complete range such as the user buffer
- can get full etc. The walk ending address is specified in``end_walk``.
- 5. The output buffer of ``struct page_region`` array and size is specified in
- ``vec`` and ``vec_len``.
- 6. The optional maximum requested pages are specified in the ``max_pages``.
- 7. The masks are specified in ``category_mask``, ``category_anyof_mask``,
- ``category_inverted`` and ``return_mask``.
- Find pages which have been written and WP them as well::
- struct pm_scan_arg arg = {
- .size = sizeof(arg),
- .flags = PM_SCAN_CHECK_WPASYNC | PM_SCAN_CHECK_WPASYNC,
- ..
- .category_mask = PAGE_IS_WRITTEN,
- .return_mask = PAGE_IS_WRITTEN,
- };
- Find pages which have been written, are file backed, not swapped and either
- present or huge::
- struct pm_scan_arg arg = {
- .size = sizeof(arg),
- .flags = 0,
- ..
- .category_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED,
- .category_inverted = PAGE_IS_SWAPPED,
- .category_anyof_mask = PAGE_IS_PRESENT | PAGE_IS_HUGE,
- .return_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED |
- PAGE_IS_PRESENT | PAGE_IS_HUGE,
- };
- The ``PAGE_IS_WRITTEN`` flag can be considered as a better-performing alternative
- of soft-dirty flag. It doesn't get affected by VMA merging of the kernel and hence
- the user can find the true soft-dirty pages in case of normal pages. (There may
- still be extra dirty pages reported for THP or Hugetlb pages.)
- "PAGE_IS_WRITTEN" category is used with uffd write protect-enabled ranges to
- implement memory dirty tracking in userspace:
- 1. The userfaultfd file descriptor is created with ``userfaultfd`` syscall.
- 2. The ``UFFD_FEATURE_WP_UNPOPULATED`` and ``UFFD_FEATURE_WP_ASYNC`` features
- are set by ``UFFDIO_API`` IOCTL.
- 3. The memory range is registered with ``UFFDIO_REGISTER_MODE_WP`` mode
- through ``UFFDIO_REGISTER`` IOCTL.
- 4. Then any part of the registered memory or the whole memory region must
- be write protected using ``PAGEMAP_SCAN`` IOCTL with flag ``PM_SCAN_WP_MATCHING``
- or the ``UFFDIO_WRITEPROTECT`` IOCTL can be used. Both of these perform the
- same operation. The former is better in terms of performance.
- 5. Now the ``PAGEMAP_SCAN`` IOCTL can be used to either just find pages which
- have been written to since they were last marked and/or optionally write protect
- the pages as well.
|