10 năm trước cách đây · 25edd8bffd
--- a/Documentation/vm/userfaultfd.txt
+++ b/Documentation/vm/userfaultfd.txt
@@ -0,0 +1,142 @@
 
				+= Userfaultfd =
			
 
				+
			
 
				+== Objective ==
			
 
				+
			
 
				+Userfaults allow the implementation of on-demand paging from userland
			
 
				+and more generally they allow userland to take control of various
			
 
				+memory page faults, something otherwise only the kernel code could do.
			
 
				+
			
 
				+For example userfaults allows a proper and more optimal implementation
			
 
				+of the PROT_NONE+SIGSEGV trick.
			
 
				+
			
 
				+== Design ==
			
 
				+
			
 
				+Userfaults are delivered and resolved through the userfaultfd syscall.
			
 
				+
			
 
				+The userfaultfd (aside from registering and unregistering virtual
			
 
				+memory ranges) provides two primary functionalities:
			
 
				+
			
 
				+1) read/POLLIN protocol to notify a userland thread of the faults
			
 
				+   happening
			
 
				+
			
 
				+2) various UFFDIO_* ioctls that can manage the virtual memory regions
			
 
				+   registered in the userfaultfd that allows userland to efficiently
			
 
				+   resolve the userfaults it receives via 1) or to manage the virtual
			
 
				+   memory in the background
			
 
				+
			
 
				+The real advantage of userfaults if compared to regular virtual memory
			
 
				+management of mremap/mprotect is that the userfaults in all their
			
 
				+operations never involve heavyweight structures like vmas (in fact the
			
 
				+userfaultfd runtime load never takes the mmap_sem for writing).
			
 
				+
			
 
				+Vmas are not suitable for page- (or hugepage) granular fault tracking
			
 
				+when dealing with virtual address spaces that could span
			
 
				+Terabytes. Too many vmas would be needed for that.
			
 
				+
			
 
				+The userfaultfd once opened by invoking the syscall, can also be
			
 
				+passed using unix domain sockets to a manager process, so the same
			
 
				+manager process could handle the userfaults of a multitude of
			
 
				+different processes without them being aware about what is going on
			
 
				+(well of course unless they later try to use the userfaultfd
			
 
				+themselves on the same region the manager is already tracking, which
			
 
				+is a corner case that would currently return -EBUSY).
			
 
				+
			
 
				+== API ==
			
 
				+
			
 
				+When first opened the userfaultfd must be enabled invoking the
			
 
				+UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
			
 
				+a later API version) which will specify the read/POLLIN protocol
			
 
				+userland intends to speak on the UFFD. The UFFDIO_API ioctl if
			
 
				+successful (i.e. if the requested uffdio_api.api is spoken also by the
			
 
				+running kernel), will return into uffdio_api.features and
			
 
				+uffdio_api.ioctls two 64bit bitmasks of respectively the activated
			
 
				+feature of the read(2) protocol and the generic ioctl available.
			
 
				+
			
 
				+Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
			
 
				+be invoked (if present in the returned uffdio_api.ioctls bitmask) to
			
 
				+register a memory range in the userfaultfd by setting the
			
 
				+uffdio_register structure accordingly. The uffdio_register.mode
			
 
				+bitmask will specify to the kernel which kind of faults to track for
			
 
				+the range (UFFDIO_REGISTER_MODE_MISSING would track missing
			
 
				+pages). The UFFDIO_REGISTER ioctl will return the
			
 
				+uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
			
 
				+userfaults on the range registered. Not all ioctls will necessarily be
			
 
				+supported for all memory types depending on the underlying virtual
			
 
				+memory backend (anonymous memory vs tmpfs vs real filebacked
			
 
				+mappings).
			
 
				+
			
 
				+Userland can use the uffdio_register.ioctls to manage the virtual
			
 
				+address space in the background (to add or potentially also remove
			
 
				+memory from the userfaultfd registered range). This means a userfault
			
 
				+could be triggering just before userland maps in the background the
			
 
				+user-faulted page.
			
 
				+
			
 
				+The primary ioctl to resolve userfaults is UFFDIO_COPY. That
			
 
				+atomically copies a page into the userfault registered range and wakes
			
 
				+up the blocked userfaults (unless uffdio_copy.mode &
			
 
				+UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
			
 
				+UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
			
 
				+half copied page since it'll keep userfaulting until the copy has
			
 
				+finished.
			
 
				+
			
 
				+== QEMU/KVM ==
			
 
				+
			
 
				+QEMU/KVM is using the userfaultfd syscall to implement postcopy live
			
 
				+migration. Postcopy live migration is one form of memory
			
 
				+externalization consisting of a virtual machine running with part or
			
 
				+all of its memory residing on a different node in the cloud. The
			
 
				+userfaultfd abstraction is generic enough that not a single line of
			
 
				+KVM kernel code had to be modified in order to add postcopy live
			
 
				+migration to QEMU.
			
 
				+
			
 
				+Guest async page faults, FOLL_NOWAIT and all other GUP features work
			
 
				+just fine in combination with userfaults. Userfaults trigger async
			
 
				+page faults in the guest scheduler so those guest processes that
			
 
				+aren't waiting for userfaults (i.e. network bound) can keep running in
			
 
				+the guest vcpus.
			
 
				+
			
 
				+It is generally beneficial to run one pass of precopy live migration
			
 
				+just before starting postcopy live migration, in order to avoid
			
 
				+generating userfaults for readonly guest regions.
			
 
				+
			
 
				+The implementation of postcopy live migration currently uses one
			
 
				+single bidirectional socket but in the future two different sockets
			
 
				+will be used (to reduce the latency of the userfaults to the minimum
			
 
				+possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
			
 
				+
			
 
				+The QEMU in the source node writes all pages that it knows are missing
			
 
				+in the destination node, into the socket, and the migration thread of
			
 
				+the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
			
 
				+ioctls on the userfaultfd in order to map the received pages into the
			
 
				+guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
			
 
				+
			
 
				+A different postcopy thread in the destination node listens with
			
 
				+poll() to the userfaultfd in parallel. When a POLLIN event is
			
 
				+generated after a userfault triggers, the postcopy thread read() from
			
 
				+the userfaultfd and receives the fault address (or -EAGAIN in case the
			
 
				+userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
			
 
				+by the parallel QEMU migration thread).
			
 
				+
			
 
				+After the QEMU postcopy thread (running in the destination node) gets
			
 
				+the userfault address it writes the information about the missing page
			
 
				+into the socket. The QEMU source node receives the information and
			
 
				+roughly "seeks" to that page address and continues sending all
			
 
				+remaining missing pages from that new page offset. Soon after that
			
 
				+(just the time to flush the tcp_wmem queue through the network) the
			
 
				+migration thread in the QEMU running in the destination node will
			
 
				+receive the page that triggered the userfault and it'll map it as
			
 
				+usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
			
 
				+was spontaneously sent by the source or if it was an urgent page
			
 
				+requested through an userfault).
			
 
				+
			
 
				+By the time the userfaults start, the QEMU in the destination node
			
 
				+doesn't need to keep any per-page state bitmap relative to the live
			
 
				+migration around and a single per-page bitmap has to be maintained in
			
 
				+the QEMU running in the source node to know which pages are still
			
 
				+missing in the destination node. The bitmap in the source node is
			
 
				+checked to find which missing pages to send in round robin and we seek
			
 
				+over it when receiving incoming userfaults. After sending each page of
			
 
				+course the bitmap is updated accordingly. It's also useful to avoid
			
 
				+sending the same page twice (in case the userfault is read by the
			
 
				+postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
			
 
				+thread).