7 years ago · 45c9a74f64
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4313,7 +4313,8 @@
 
				 			Format: [always|madvise|never]
			
 
				 			Can be used to control the default behavior of the system
			
 
				 			with respect to transparent hugepages.
			
 
				-			See Documentation/vm/transhuge.rst for more details.
			
 
				+			See Documentation/admin-guide/mm/transhuge.rst
			
 
				+			for more details.
			
 
				 
			
 
				 	tsc=		Disable clocksource stability checks for TSC.
			
 
				 			Format: <string>
			
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -27,4 +27,5 @@ the Linux memory management.
 
				    numa_memory_policy
			
 
				    pagemap
			
 
				    soft-dirty
			
 
				+   transhuge
			
 
				    userfaultfd
			
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -0,0 +1,418 @@
 
				+.. _admin_guide_transhuge:
			
 
				+
			
 
				+============================
			
 
				+Transparent Hugepage Support
			
 
				+============================
			
 
				+
			
 
				+Objective
			
 
				+=========
			
 
				+
			
 
				+Performance critical computing applications dealing with large memory
			
 
				+working sets are already running on top of libhugetlbfs and in turn
			
 
				+hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
			
 
				+using huge pages for the backing of virtual memory with huge pages
			
 
				+that supports the automatic promotion and demotion of page sizes and
			
 
				+without the shortcomings of hugetlbfs.
			
 
				+
			
 
				+Currently THP only works for anonymous memory mappings and tmpfs/shmem.
			
 
				+But in the future it can expand to other filesystems.
			
 
				+
			
 
				+.. note::
			
 
				+   in the examples below we presume that the basic page size is 4K and
			
 
				+   the huge page size is 2M, although the actual numbers may vary
			
 
				+   depending on the CPU architecture.
			
 
				+
			
 
				+The reason applications are running faster is because of two
			
 
				+factors. The first factor is almost completely irrelevant and it's not
			
 
				+of significant interest because it'll also have the downside of
			
 
				+requiring larger clear-page copy-page in page faults which is a
			
 
				+potentially negative effect. The first factor consists in taking a
			
 
				+single page fault for each 2M virtual region touched by userland (so
			
 
				+reducing the enter/exit kernel frequency by a 512 times factor). This
			
 
				+only matters the first time the memory is accessed for the lifetime of
			
 
				+a memory mapping. The second long lasting and much more important
			
 
				+factor will affect all subsequent accesses to the memory for the whole
			
 
				+runtime of the application. The second factor consist of two
			
 
				+components:
			
 
				+
			
 
				+1) the TLB miss will run faster (especially with virtualization using
			
 
				+   nested pagetables but almost always also on bare metal without
			
 
				+   virtualization)
			
 
				+
			
 
				+2) a single TLB entry will be mapping a much larger amount of virtual
			
 
				+   memory in turn reducing the number of TLB misses. With
			
 
				+   virtualization and nested pagetables the TLB can be mapped of
			
 
				+   larger size only if both KVM and the Linux guest are using
			
 
				+   hugepages but a significant speedup already happens if only one of
			
 
				+   the two is using hugepages just because of the fact the TLB miss is
			
 
				+   going to run faster.
			
 
				+
			
 
				+THP can be enabled system wide or restricted to certain tasks or even
			
 
				+memory ranges inside task's address space. Unless THP is completely
			
 
				+disabled, there is ``khugepaged`` daemon that scans memory and
			
 
				+collapses sequences of basic pages into huge pages.
			
 
				+
			
 
				+The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
			
 
				+interface and using madivse(2) and prctl(2) system calls.
			
 
				+
			
 
				+Transparent Hugepage Support maximizes the usefulness of free memory
			
 
				+if compared to the reservation approach of hugetlbfs by allowing all
			
 
				+unused memory to be used as cache or other movable (or even unmovable
			
 
				+entities). It doesn't require reservation to prevent hugepage
			
 
				+allocation failures to be noticeable from userland. It allows paging
			
 
				+and all other advanced VM features to be available on the
			
 
				+hugepages. It requires no modifications for applications to take
			
 
				+advantage of it.
			
 
				+
			
 
				+Applications however can be further optimized to take advantage of
			
 
				+this feature, like for example they've been optimized before to avoid
			
 
				+a flood of mmap system calls for every malloc(4k). Optimizing userland
			
 
				+is by far not mandatory and khugepaged already can take care of long
			
 
				+lived page allocations even for hugepage unaware applications that
			
 
				+deals with large amounts of memory.
			
 
				+
			
 
				+In certain cases when hugepages are enabled system wide, application
			
 
				+may end up allocating more memory resources. An application may mmap a
			
 
				+large region but only touch 1 byte of it, in that case a 2M page might
			
 
				+be allocated instead of a 4k page for no good. This is why it's
			
 
				+possible to disable hugepages system-wide and to only have them inside
			
 
				+MADV_HUGEPAGE madvise regions.
			
 
				+
			
 
				+Embedded systems should enable hugepages only inside madvise regions
			
 
				+to eliminate any risk of wasting any precious byte of memory and to
			
 
				+only run faster.
			
 
				+
			
 
				+Applications that gets a lot of benefit from hugepages and that don't
			
 
				+risk to lose memory by using hugepages, should use
			
 
				+madvise(MADV_HUGEPAGE) on their critical mmapped regions.
			
 
				+
			
 
				+.. _thp_sysfs:
			
 
				+
			
 
				+sysfs
			
 
				+=====
			
 
				+
			
 
				+Global THP controls
			
 
				+-------------------
			
 
				+
			
 
				+Transparent Hugepage Support for anonymous memory can be entirely disabled
			
 
				+(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
			
 
				+regions (to avoid the risk of consuming more memory resources) or enabled
			
 
				+system wide. This can be achieved with one of::
			
 
				+
			
 
				+	echo always >/sys/kernel/mm/transparent_hugepage/enabled
			
 
				+	echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
			
 
				+	echo never >/sys/kernel/mm/transparent_hugepage/enabled
			
 
				+
			
 
				+It's also possible to limit defrag efforts in the VM to generate
			
 
				+anonymous hugepages in case they're not immediately free to madvise
			
 
				+regions or to never try to defrag memory and simply fallback to regular
			
 
				+pages unless hugepages are immediately available. Clearly if we spend CPU
			
 
				+time to defrag memory, we would expect to gain even more by the fact we
			
 
				+use hugepages later instead of regular pages. This isn't always
			
 
				+guaranteed, but it may be more likely in case the allocation is for a
			
 
				+MADV_HUGEPAGE region.
			
 
				+
			
 
				+::
			
 
				+
			
 
				+	echo always >/sys/kernel/mm/transparent_hugepage/defrag
			
 
				+	echo defer >/sys/kernel/mm/transparent_hugepage/defrag
			
 
				+	echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
			
 
				+	echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
			
 
				+	echo never >/sys/kernel/mm/transparent_hugepage/defrag
			
 
				+
			
 
				+always
			
 
				+	means that an application requesting THP will stall on
			
 
				+	allocation failure and directly reclaim pages and compact
			
 
				+	memory in an effort to allocate a THP immediately. This may be
			
 
				+	desirable for virtual machines that benefit heavily from THP
			
 
				+	use and are willing to delay the VM start to utilise them.
			
 
				+
			
 
				+defer
			
 
				+	means that an application will wake kswapd in the background
			
 
				+	to reclaim pages and wake kcompactd to compact memory so that
			
 
				+	THP is available in the near future. It's the responsibility
			
 
				+	of khugepaged to then install the THP pages later.
			
 
				+
			
 
				+defer+madvise
			
 
				+	will enter direct reclaim and compaction like ``always``, but
			
 
				+	only for regions that have used madvise(MADV_HUGEPAGE); all
			
 
				+	other regions will wake kswapd in the background to reclaim
			
 
				+	pages and wake kcompactd to compact memory so that THP is
			
 
				+	available in the near future.
			
 
				+
			
 
				+madvise
			
 
				+	will enter direct reclaim like ``always`` but only for regions
			
 
				+	that are have used madvise(MADV_HUGEPAGE). This is the default
			
 
				+	behaviour.
			
 
				+
			
 
				+never
			
 
				+	should be self-explanatory.
			
 
				+
			
 
				+By default kernel tries to use huge zero page on read page fault to
			
 
				+anonymous mapping. It's possible to disable huge zero page by writing 0
			
 
				+or enable it back by writing 1::
			
 
				+
			
 
				+	echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
			
 
				+	echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
			
 
				+
			
 
				+Some userspace (such as a test program, or an optimized memory allocation
			
 
				+library) may want to know the size (in bytes) of a transparent hugepage::
			
 
				+
			
 
				+	cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
			
 
				+
			
 
				+khugepaged will be automatically started when
			
 
				+transparent_hugepage/enabled is set to "always" or "madvise, and it'll
			
 
				+be automatically shutdown if it's set to "never".
			
 
				+
			
 
				+Khugepaged controls
			
 
				+-------------------
			
 
				+
			
 
				+khugepaged runs usually at low frequency so while one may not want to
			
 
				+invoke defrag algorithms synchronously during the page faults, it
			
 
				+should be worth invoking defrag at least in khugepaged. However it's
			
 
				+also possible to disable defrag in khugepaged by writing 0 or enable
			
 
				+defrag in khugepaged by writing 1::
			
 
				+
			
 
				+	echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
			
 
				+	echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
			
 
				+
			
 
				+You can also control how many pages khugepaged should scan at each
			
 
				+pass::
			
 
				+
			
 
				+	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
			
 
				+
			
 
				+and how many milliseconds to wait in khugepaged between each pass (you
			
 
				+can set this to 0 to run khugepaged at 100% utilization of one core)::
			
 
				+
			
 
				+	/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
			
 
				+
			
 
				+and how many milliseconds to wait in khugepaged if there's an hugepage
			
 
				+allocation failure to throttle the next allocation attempt::
			
 
				+
			
 
				+	/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
			
 
				+
			
 
				+The khugepaged progress can be seen in the number of pages collapsed::
			
 
				+
			
 
				+	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
			
 
				+
			
 
				+for each pass::
			
 
				+
			
 
				+	/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
			
 
				+
			
 
				+``max_ptes_none`` specifies how many extra small pages (that are
			
 
				+not already mapped) can be allocated when collapsing a group
			
 
				+of small pages into one large page::
			
 
				+
			
 
				+	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
			
 
				+
			
 
				+A higher value leads to use additional memory for programs.
			
 
				+A lower value leads to gain less thp performance. Value of
			
 
				+max_ptes_none can waste cpu time very little, you can
			
 
				+ignore it.
			
 
				+
			
 
				+``max_ptes_swap`` specifies how many pages can be brought in from
			
 
				+swap when collapsing a group of pages into a transparent huge page::
			
 
				+
			
 
				+	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
			
 
				+
			
 
				+A higher value can cause excessive swap IO and waste
			
 
				+memory. A lower value can prevent THPs from being
			
 
				+collapsed, resulting fewer pages being collapsed into
			
 
				+THPs, and lower memory access performance.
			
 
				+
			
 
				+Boot parameter
			
 
				+==============
			
 
				+
			
 
				+You can change the sysfs boot time defaults of Transparent Hugepage
			
 
				+Support by passing the parameter ``transparent_hugepage=always`` or
			
 
				+``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
			
 
				+to the kernel command line.
			
 
				+
			
 
				+Hugepages in tmpfs/shmem
			
 
				+========================
			
 
				+
			
 
				+You can control hugepage allocation policy in tmpfs with mount option
			
 
				+``huge=``. It can have following values:
			
 
				+
			
 
				+always
			
 
				+    Attempt to allocate huge pages every time we need a new page;
			
 
				+
			
 
				+never
			
 
				+    Do not allocate huge pages;
			
 
				+
			
 
				+within_size
			
 
				+    Only allocate huge page if it will be fully within i_size.
			
 
				+    Also respect fadvise()/madvise() hints;
			
 
				+
			
 
				+advise
			
 
				+    Only allocate huge pages if requested with fadvise()/madvise();
			
 
				+
			
 
				+The default policy is ``never``.
			
 
				+
			
 
				+``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
			
 
				+``huge=never`` will not attempt to break up huge pages at all, just stop more
			
 
				+from being allocated.
			
 
				+
			
 
				+There's also sysfs knob to control hugepage allocation policy for internal
			
 
				+shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
			
 
				+is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
			
 
				+MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
			
 
				+
			
 
				+In addition to policies listed above, shmem_enabled allows two further
			
 
				+values:
			
 
				+
			
 
				+deny
			
 
				+    For use in emergencies, to force the huge option off from
			
 
				+    all mounts;
			
 
				+force
			
 
				+    Force the huge option on for all - very useful for testing;
			
 
				+
			
 
				+Need of application restart
			
 
				+===========================
			
 
				+
			
 
				+The transparent_hugepage/enabled values and tmpfs mount option only affect
			
 
				+future behavior. So to make them effective you need to restart any
			
 
				+application that could have been using hugepages. This also applies to the
			
 
				+regions registered in khugepaged.
			
 
				+
			
 
				+Monitoring usage
			
 
				+================
			
 
				+
			
 
				+The number of anonymous transparent huge pages currently used by the
			
 
				+system is available by reading the AnonHugePages field in ``/proc/meminfo``.
			
 
				+To identify what applications are using anonymous transparent huge pages,
			
 
				+it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
			
 
				+for each mapping.
			
 
				+
			
 
				+The number of file transparent huge pages mapped to userspace is available
			
 
				+by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
			
 
				+To identify what applications are mapping file transparent huge pages, it
			
 
				+is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
			
 
				+for each mapping.
			
 
				+
			
 
				+Note that reading the smaps file is expensive and reading it
			
 
				+frequently will incur overhead.
			
 
				+
			
 
				+There are a number of counters in ``/proc/vmstat`` that may be used to
			
 
				+monitor how successfully the system is providing huge pages for use.
			
 
				+
			
 
				+thp_fault_alloc
			
 
				+	is incremented every time a huge page is successfully
			
 
				+	allocated to handle a page fault. This applies to both the
			
 
				+	first time a page is faulted and for COW faults.
			
 
				+
			
 
				+thp_collapse_alloc
			
 
				+	is incremented by khugepaged when it has found
			
 
				+	a range of pages to collapse into one huge page and has
			
 
				+	successfully allocated a new huge page to store the data.
			
 
				+
			
 
				+thp_fault_fallback
			
 
				+	is incremented if a page fault fails to allocate
			
 
				+	a huge page and instead falls back to using small pages.
			
 
				+
			
 
				+thp_collapse_alloc_failed
			
 
				+	is incremented if khugepaged found a range
			
 
				+	of pages that should be collapsed into one huge page but failed
			
 
				+	the allocation.
			
 
				+
			
 
				+thp_file_alloc
			
 
				+	is incremented every time a file huge page is successfully
			
 
				+	allocated.
			
 
				+
			
 
				+thp_file_mapped
			
 
				+	is incremented every time a file huge page is mapped into
			
 
				+	user address space.
			
 
				+
			
 
				+thp_split_page
			
 
				+	is incremented every time a huge page is split into base
			
 
				+	pages. This can happen for a variety of reasons but a common
			
 
				+	reason is that a huge page is old and is being reclaimed.
			
 
				+	This action implies splitting all PMD the page mapped with.
			
 
				+
			
 
				+thp_split_page_failed
			
 
				+	is incremented if kernel fails to split huge
			
 
				+	page. This can happen if the page was pinned by somebody.
			
 
				+
			
 
				+thp_deferred_split_page
			
 
				+	is incremented when a huge page is put onto split
			
 
				+	queue. This happens when a huge page is partially unmapped and
			
 
				+	splitting it would free up some memory. Pages on split queue are
			
 
				+	going to be split under memory pressure.
			
 
				+
			
 
				+thp_split_pmd
			
 
				+	is incremented every time a PMD split into table of PTEs.
			
 
				+	This can happen, for instance, when application calls mprotect() or
			
 
				+	munmap() on part of huge page. It doesn't split huge page, only
			
 
				+	page table entry.
			
 
				+
			
 
				+thp_zero_page_alloc
			
 
				+	is incremented every time a huge zero page is
			
 
				+	successfully allocated. It includes allocations which where
			
 
				+	dropped due race with other allocation. Note, it doesn't count
			
 
				+	every map of the huge zero page, only its allocation.
			
 
				+
			
 
				+thp_zero_page_alloc_failed
			
 
				+	is incremented if kernel fails to allocate
			
 
				+	huge zero page and falls back to using small pages.
			
 
				+
			
 
				+thp_swpout
			
 
				+	is incremented every time a huge page is swapout in one
			
 
				+	piece without splitting.
			
 
				+
			
 
				+thp_swpout_fallback
			
 
				+	is incremented if a huge page has to be split before swapout.
			
 
				+	Usually because failed to allocate some continuous swap space
			
 
				+	for the huge page.
			
 
				+
			
 
				+As the system ages, allocating huge pages may be expensive as the
			
 
				+system uses memory compaction to copy data around memory to free a
			
 
				+huge page for use. There are some counters in ``/proc/vmstat`` to help
			
 
				+monitor this overhead.
			
 
				+
			
 
				+compact_stall
			
 
				+	is incremented every time a process stalls to run
			
 
				+	memory compaction so that a huge page is free for use.
			
 
				+
			
 
				+compact_success
			
 
				+	is incremented if the system compacted memory and
			
 
				+	freed a huge page for use.
			
 
				+
			
 
				+compact_fail
			
 
				+	is incremented if the system tries to compact memory
			
 
				+	but failed.
			
 
				+
			
 
				+compact_pages_moved
			
 
				+	is incremented each time a page is moved. If
			
 
				+	this value is increasing rapidly, it implies that the system
			
 
				+	is copying a lot of data to satisfy the huge page allocation.
			
 
				+	It is possible that the cost of copying exceeds any savings
			
 
				+	from reduced TLB misses.
			
 
				+
			
 
				+compact_pagemigrate_failed
			
 
				+	is incremented when the underlying mechanism
			
 
				+	for moving a page failed.
			
 
				+
			
 
				+compact_blocks_moved
			
 
				+	is incremented each time memory compaction examines
			
 
				+	a huge page aligned range of pages.
			
 
				+
			
 
				+It is possible to establish how long the stalls were using the function
			
 
				+tracer to record how long was spent in __alloc_pages_nodemask and
			
 
				+using the mm_page_alloc tracepoint to identify which allocations were
			
 
				+for huge pages.
			
 
				+
			
 
				+Optimizing the applications
			
 
				+===========================
			
 
				+
			
 
				+To be guaranteed that the kernel will map a 2M page immediately in any
			
 
				+memory region, the mmap region has to be hugepage naturally
			
 
				+aligned. posix_memalign() can provide that guarantee.
			
 
				+
			
 
				+Hugetlbfs
			
 
				+=========
			
 
				+
			
 
				+You can use hugetlbfs on a kernel that has transparent hugepage
			
 
				+support enabled just fine as always. No difference can be noted in
			
 
				+hugetlbfs other than there will be less overall fragmentation. All
			
 
				+usual features belonging to hugetlbfs are preserved and
			
 
				+unaffected. libhugetlbfs will also work fine as usual.
			
--- a/Documentation/vm/transhuge.rst
+++ b/Documentation/vm/transhuge.rst
@@ -4,418 +4,8 @@
 
				 Transparent Hugepage Support
			
 
				 ============================
			
 
				 
			
 
				-Objective
			
 
				-=========
			
 
				-
			
 
				-Performance critical computing applications dealing with large memory
			
 
				-working sets are already running on top of libhugetlbfs and in turn
			
 
				-hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
			
 
				-using huge pages for the backing of virtual memory with huge pages
			
 
				-that supports the automatic promotion and demotion of page sizes and
			
 
				-without the shortcomings of hugetlbfs.
			
 
				-
			
 
				-Currently THP only works for anonymous memory mappings and tmpfs/shmem.
			
 
				-But in the future it can expand to other filesystems.
			
 
				-
			
 
				-.. note::
			
 
				-   in the examples below we presume that the basic page size is 4K and
			
 
				-   the huge page size is 2M, although the actual numbers may vary
			
 
				-   depending on the CPU architecture.
			
 
				-
			
 
				-The reason applications are running faster is because of two
			
 
				-factors. The first factor is almost completely irrelevant and it's not
			
 
				-of significant interest because it'll also have the downside of
			
 
				-requiring larger clear-page copy-page in page faults which is a
			
 
				-potentially negative effect. The first factor consists in taking a
			
 
				-single page fault for each 2M virtual region touched by userland (so
			
 
				-reducing the enter/exit kernel frequency by a 512 times factor). This
			
 
				-only matters the first time the memory is accessed for the lifetime of
			
 
				-a memory mapping. The second long lasting and much more important
			
 
				-factor will affect all subsequent accesses to the memory for the whole
			
 
				-runtime of the application. The second factor consist of two
			
 
				-components:
			
 
				-
			
 
				-1) the TLB miss will run faster (especially with virtualization using
			
 
				-   nested pagetables but almost always also on bare metal without
			
 
				-   virtualization)
			
 
				-
			
 
				-2) a single TLB entry will be mapping a much larger amount of virtual
			
 
				-   memory in turn reducing the number of TLB misses. With
			
 
				-   virtualization and nested pagetables the TLB can be mapped of
			
 
				-   larger size only if both KVM and the Linux guest are using
			
 
				-   hugepages but a significant speedup already happens if only one of
			
 
				-   the two is using hugepages just because of the fact the TLB miss is
			
 
				-   going to run faster.
			
 
				-
			
 
				-THP can be enabled system wide or restricted to certain tasks or even
			
 
				-memory ranges inside task's address space. Unless THP is completely
			
 
				-disabled, there is ``khugepaged`` daemon that scans memory and
			
 
				-collapses sequences of basic pages into huge pages.
			
 
				-
			
 
				-The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
			
 
				-interface and using madivse(2) and prctl(2) system calls.
			
 
				-
			
 
				-Transparent Hugepage Support maximizes the usefulness of free memory
			
 
				-if compared to the reservation approach of hugetlbfs by allowing all
			
 
				-unused memory to be used as cache or other movable (or even unmovable
			
 
				-entities). It doesn't require reservation to prevent hugepage
			
 
				-allocation failures to be noticeable from userland. It allows paging
			
 
				-and all other advanced VM features to be available on the
			
 
				-hugepages. It requires no modifications for applications to take
			
 
				-advantage of it.
			
 
				-
			
 
				-Applications however can be further optimized to take advantage of
			
 
				-this feature, like for example they've been optimized before to avoid
			
 
				-a flood of mmap system calls for every malloc(4k). Optimizing userland
			
 
				-is by far not mandatory and khugepaged already can take care of long
			
 
				-lived page allocations even for hugepage unaware applications that
			
 
				-deals with large amounts of memory.
			
 
				-
			
 
				-In certain cases when hugepages are enabled system wide, application
			
 
				-may end up allocating more memory resources. An application may mmap a
			
 
				-large region but only touch 1 byte of it, in that case a 2M page might
			
 
				-be allocated instead of a 4k page for no good. This is why it's
			
 
				-possible to disable hugepages system-wide and to only have them inside
			
 
				-MADV_HUGEPAGE madvise regions.
			
 
				-
			
 
				-Embedded systems should enable hugepages only inside madvise regions
			
 
				-to eliminate any risk of wasting any precious byte of memory and to
			
 
				-only run faster.
			
 
				-
			
 
				-Applications that gets a lot of benefit from hugepages and that don't
			
 
				-risk to lose memory by using hugepages, should use
			
 
				-madvise(MADV_HUGEPAGE) on their critical mmapped regions.
			
 
				-
			
 
				-.. _thp_sysfs:
			
 
				-
			
 
				-sysfs
			
 
				-=====
			
 
				-
			
 
				-Global THP controls
			
 
				--------------------
			
 
				-
			
 
				-Transparent Hugepage Support for anonymous memory can be entirely disabled
			
 
				-(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
			
 
				-regions (to avoid the risk of consuming more memory resources) or enabled
			
 
				-system wide. This can be achieved with one of::
			
 
				-
			
 
				-	echo always >/sys/kernel/mm/transparent_hugepage/enabled
			
 
				-	echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
			
 
				-	echo never >/sys/kernel/mm/transparent_hugepage/enabled
			
 
				-
			
 
				-It's also possible to limit defrag efforts in the VM to generate
			
 
				-anonymous hugepages in case they're not immediately free to madvise
			
 
				-regions or to never try to defrag memory and simply fallback to regular
			
 
				-pages unless hugepages are immediately available. Clearly if we spend CPU
			
 
				-time to defrag memory, we would expect to gain even more by the fact we
			
 
				-use hugepages later instead of regular pages. This isn't always
			
 
				-guaranteed, but it may be more likely in case the allocation is for a
			
 
				-MADV_HUGEPAGE region.
			
 
				-
			
 
				-::
			
 
				-
			
 
				-	echo always >/sys/kernel/mm/transparent_hugepage/defrag
			
 
				-	echo defer >/sys/kernel/mm/transparent_hugepage/defrag
			
 
				-	echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
			
 
				-	echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
			
 
				-	echo never >/sys/kernel/mm/transparent_hugepage/defrag
			
 
				-
			
 
				-always
			
 
				-	means that an application requesting THP will stall on
			
 
				-	allocation failure and directly reclaim pages and compact
			
 
				-	memory in an effort to allocate a THP immediately. This may be
			
 
				-	desirable for virtual machines that benefit heavily from THP
			
 
				-	use and are willing to delay the VM start to utilise them.
			
 
				-
			
 
				-defer
			
 
				-	means that an application will wake kswapd in the background
			
 
				-	to reclaim pages and wake kcompactd to compact memory so that
			
 
				-	THP is available in the near future. It's the responsibility
			
 
				-	of khugepaged to then install the THP pages later.
			
 
				-
			
 
				-defer+madvise
			
 
				-	will enter direct reclaim and compaction like ``always``, but
			
 
				-	only for regions that have used madvise(MADV_HUGEPAGE); all
			
 
				-	other regions will wake kswapd in the background to reclaim
			
 
				-	pages and wake kcompactd to compact memory so that THP is
			
 
				-	available in the near future.
			
 
				-
			
 
				-madvise
			
 
				-	will enter direct reclaim like ``always`` but only for regions
			
 
				-	that are have used madvise(MADV_HUGEPAGE). This is the default
			
 
				-	behaviour.
			
 
				-
			
 
				-never
			
 
				-	should be self-explanatory.
			
 
				-
			
 
				-By default kernel tries to use huge zero page on read page fault to
			
 
				-anonymous mapping. It's possible to disable huge zero page by writing 0
			
 
				-or enable it back by writing 1::
			
 
				-
			
 
				-	echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
			
 
				-	echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
			
 
				-
			
 
				-Some userspace (such as a test program, or an optimized memory allocation
			
 
				-library) may want to know the size (in bytes) of a transparent hugepage::
			
 
				-
			
 
				-	cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
			
 
				-
			
 
				-khugepaged will be automatically started when
			
 
				-transparent_hugepage/enabled is set to "always" or "madvise, and it'll
			
 
				-be automatically shutdown if it's set to "never".
			
 
				-
			
 
				-Khugepaged controls
			
 
				--------------------
			
 
				-
			
 
				-khugepaged runs usually at low frequency so while one may not want to
			
 
				-invoke defrag algorithms synchronously during the page faults, it
			
 
				-should be worth invoking defrag at least in khugepaged. However it's
			
 
				-also possible to disable defrag in khugepaged by writing 0 or enable
			
 
				-defrag in khugepaged by writing 1::
			
 
				-
			
 
				-	echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
			
 
				-	echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
			
 
				-
			
 
				-You can also control how many pages khugepaged should scan at each
			
 
				-pass::
			
 
				-
			
 
				-	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
			
 
				-
			
 
				-and how many milliseconds to wait in khugepaged between each pass (you
			
 
				-can set this to 0 to run khugepaged at 100% utilization of one core)::
			
 
				-
			
 
				-	/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
			
 
				-
			
 
				-and how many milliseconds to wait in khugepaged if there's an hugepage
			
 
				-allocation failure to throttle the next allocation attempt::
			
 
				-
			
 
				-	/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
			
 
				-
			
 
				-The khugepaged progress can be seen in the number of pages collapsed::
			
 
				-
			
 
				-	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
			
 
				-
			
 
				-for each pass::
			
 
				-
			
 
				-	/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
			
 
				-
			
 
				-``max_ptes_none`` specifies how many extra small pages (that are
			
 
				-not already mapped) can be allocated when collapsing a group
			
 
				-of small pages into one large page::
			
 
				-
			
 
				-	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
			
 
				-
			
 
				-A higher value leads to use additional memory for programs.
			
 
				-A lower value leads to gain less thp performance. Value of
			
 
				-max_ptes_none can waste cpu time very little, you can
			
 
				-ignore it.
			
 
				-
			
 
				-``max_ptes_swap`` specifies how many pages can be brought in from
			
 
				-swap when collapsing a group of pages into a transparent huge page::
			
 
				-
			
 
				-	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
			
 
				-
			
 
				-A higher value can cause excessive swap IO and waste
			
 
				-memory. A lower value can prevent THPs from being
			
 
				-collapsed, resulting fewer pages being collapsed into
			
 
				-THPs, and lower memory access performance.
			
 
				-
			
 
				-Boot parameter
			
 
				-==============
			
 
				-
			
 
				-You can change the sysfs boot time defaults of Transparent Hugepage
			
 
				-Support by passing the parameter ``transparent_hugepage=always`` or
			
 
				-``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
			
 
				-to the kernel command line.
			
 
				-
			
 
				-Hugepages in tmpfs/shmem
			
 
				-========================
			
 
				-
			
 
				-You can control hugepage allocation policy in tmpfs with mount option
			
 
				-``huge=``. It can have following values:
			
 
				-
			
 
				-always
			
 
				-    Attempt to allocate huge pages every time we need a new page;
			
 
				-
			
 
				-never
			
 
				-    Do not allocate huge pages;
			
 
				-
			
 
				-within_size
			
 
				-    Only allocate huge page if it will be fully within i_size.
			
 
				-    Also respect fadvise()/madvise() hints;
			
 
				-
			
 
				-advise
			
 
				-    Only allocate huge pages if requested with fadvise()/madvise();
			
 
				-
			
 
				-The default policy is ``never``.
			
 
				-
			
 
				-``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
			
 
				-``huge=never`` will not attempt to break up huge pages at all, just stop more
			
 
				-from being allocated.
			
 
				-
			
 
				-There's also sysfs knob to control hugepage allocation policy for internal
			
 
				-shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
			
 
				-is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
			
 
				-MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
			
 
				-
			
 
				-In addition to policies listed above, shmem_enabled allows two further
			
 
				-values:
			
 
				-
			
 
				-deny
			
 
				-    For use in emergencies, to force the huge option off from
			
 
				-    all mounts;
			
 
				-force
			
 
				-    Force the huge option on for all - very useful for testing;
			
 
				-
			
 
				-Need of application restart
			
 
				-===========================
			
 
				-
			
 
				-The transparent_hugepage/enabled values and tmpfs mount option only affect
			
 
				-future behavior. So to make them effective you need to restart any
			
 
				-application that could have been using hugepages. This also applies to the
			
 
				-regions registered in khugepaged.
			
 
				-
			
 
				-Monitoring usage
			
 
				-================
			
 
				-
			
 
				-The number of anonymous transparent huge pages currently used by the
			
 
				-system is available by reading the AnonHugePages field in ``/proc/meminfo``.
			
 
				-To identify what applications are using anonymous transparent huge pages,
			
 
				-it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
			
 
				-for each mapping.
			
 
				-
			
 
				-The number of file transparent huge pages mapped to userspace is available
			
 
				-by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
			
 
				-To identify what applications are mapping file transparent huge pages, it
			
 
				-is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
			
 
				-for each mapping.
			
 
				-
			
 
				-Note that reading the smaps file is expensive and reading it
			
 
				-frequently will incur overhead.
			
 
				-
			
 
				-There are a number of counters in ``/proc/vmstat`` that may be used to
			
 
				-monitor how successfully the system is providing huge pages for use.
			
 
				-
			
 
				-thp_fault_alloc
			
 
				-	is incremented every time a huge page is successfully
			
 
				-	allocated to handle a page fault. This applies to both the
			
 
				-	first time a page is faulted and for COW faults.
			
 
				-
			
 
				-thp_collapse_alloc
			
 
				-	is incremented by khugepaged when it has found
			
 
				-	a range of pages to collapse into one huge page and has
			
 
				-	successfully allocated a new huge page to store the data.
			
 
				-
			
 
				-thp_fault_fallback
			
 
				-	is incremented if a page fault fails to allocate
			
 
				-	a huge page and instead falls back to using small pages.
			
 
				-
			
 
				-thp_collapse_alloc_failed
			
 
				-	is incremented if khugepaged found a range
			
 
				-	of pages that should be collapsed into one huge page but failed
			
 
				-	the allocation.
			
 
				-
			
 
				-thp_file_alloc
			
 
				-	is incremented every time a file huge page is successfully
			
 
				-	allocated.
			
 
				-
			
 
				-thp_file_mapped
			
 
				-	is incremented every time a file huge page is mapped into
			
 
				-	user address space.
			
 
				-
			
 
				-thp_split_page
			
 
				-	is incremented every time a huge page is split into base
			
 
				-	pages. This can happen for a variety of reasons but a common
			
 
				-	reason is that a huge page is old and is being reclaimed.
			
 
				-	This action implies splitting all PMD the page mapped with.
			
 
				-
			
 
				-thp_split_page_failed
			
 
				-	is incremented if kernel fails to split huge
			
 
				-	page. This can happen if the page was pinned by somebody.
			
 
				-
			
 
				-thp_deferred_split_page
			
 
				-	is incremented when a huge page is put onto split
			
 
				-	queue. This happens when a huge page is partially unmapped and
			
 
				-	splitting it would free up some memory. Pages on split queue are
			
 
				-	going to be split under memory pressure.
			
 
				-
			
 
				-thp_split_pmd
			
 
				-	is incremented every time a PMD split into table of PTEs.
			
 
				-	This can happen, for instance, when application calls mprotect() or
			
 
				-	munmap() on part of huge page. It doesn't split huge page, only
			
 
				-	page table entry.
			
 
				-
			
 
				-thp_zero_page_alloc
			
 
				-	is incremented every time a huge zero page is
			
 
				-	successfully allocated. It includes allocations which where
			
 
				-	dropped due race with other allocation. Note, it doesn't count
			
 
				-	every map of the huge zero page, only its allocation.
			
 
				-
			
 
				-thp_zero_page_alloc_failed
			
 
				-	is incremented if kernel fails to allocate
			
 
				-	huge zero page and falls back to using small pages.
			
 
				-
			
 
				-thp_swpout
			
 
				-	is incremented every time a huge page is swapout in one
			
 
				-	piece without splitting.
			
 
				-
			
 
				-thp_swpout_fallback
			
 
				-	is incremented if a huge page has to be split before swapout.
			
 
				-	Usually because failed to allocate some continuous swap space
			
 
				-	for the huge page.
			
 
				-
			
 
				-As the system ages, allocating huge pages may be expensive as the
			
 
				-system uses memory compaction to copy data around memory to free a
			
 
				-huge page for use. There are some counters in ``/proc/vmstat`` to help
			
 
				-monitor this overhead.
			
 
				-
			
 
				-compact_stall
			
 
				-	is incremented every time a process stalls to run
			
 
				-	memory compaction so that a huge page is free for use.
			
 
				-
			
 
				-compact_success
			
 
				-	is incremented if the system compacted memory and
			
 
				-	freed a huge page for use.
			
 
				-
			
 
				-compact_fail
			
 
				-	is incremented if the system tries to compact memory
			
 
				-	but failed.
			
 
				-
			
 
				-compact_pages_moved
			
 
				-	is incremented each time a page is moved. If
			
 
				-	this value is increasing rapidly, it implies that the system
			
 
				-	is copying a lot of data to satisfy the huge page allocation.
			
 
				-	It is possible that the cost of copying exceeds any savings
			
 
				-	from reduced TLB misses.
			
 
				-
			
 
				-compact_pagemigrate_failed
			
 
				-	is incremented when the underlying mechanism
			
 
				-	for moving a page failed.
			
 
				-
			
 
				-compact_blocks_moved
			
 
				-	is incremented each time memory compaction examines
			
 
				-	a huge page aligned range of pages.
			
 
				-
			
 
				-It is possible to establish how long the stalls were using the function
			
 
				-tracer to record how long was spent in __alloc_pages_nodemask and
			
 
				-using the mm_page_alloc tracepoint to identify which allocations were
			
 
				-for huge pages.
			
 
				-
			
 
				-Optimizing the applications
			
 
				-===========================
			
 
				-
			
 
				-To be guaranteed that the kernel will map a 2M page immediately in any
			
 
				-memory region, the mmap region has to be hugepage naturally
			
 
				-aligned. posix_memalign() can provide that guarantee.
			
 
				-
			
 
				-Hugetlbfs
			
 
				-=========
			
 
				-
			
 
				-You can use hugetlbfs on a kernel that has transparent hugepage
			
 
				-support enabled just fine as always. No difference can be noted in
			
 
				-hugetlbfs other than there will be less overall fragmentation. All
			
 
				-usual features belonging to hugetlbfs are preserved and
			
 
				-unaffected. libhugetlbfs will also work fine as usual.
			
 
				+This document describes design principles Transparent Hugepage (THP)
			
 
				+Support and its interaction with other parts of the memory management.
			
 
				 
			
 
				 Design principles
			
 
				 =================