il y a 9 ans · 9f08cfe944
--- a/Documentation/filesystems/orangefs.txt
+++ b/Documentation/filesystems/orangefs.txt
@@ -221,18 +221,71 @@ contains the "downcall" which expresses the results of the request.
 
				 
			
 
				 The slab allocator is used to keep a cache of op structures handy.
			
 
				 
			
 
				-The life cycle of a typical op goes like this:
			
 
				-
			
 
				-  - obtain and initialize an op structure from the op_cache.
			
 
				-
			
 
				-  - queue the op to the pvfs device so that its upcall data can be
			
 
				-    read by userspace.
			
 
				-
			
 
				-  - wait for userspace to write downcall data back to the pvfs device.
			
 
				-
			
 
				-  - consume the downcall and return the op struct to the op_cache.
			
 
				-
			
 
				-Some ops are atypical with respect to their payloads: readdir and io ops.
			
 
				+At init time the kernel module defines and initializes a request list
			
 
				+and an in_progress hash table to keep track of all the ops that are
			
 
				+in flight at any given time.
			
 
				+
			
 
				+Ops are stateful:
			
 
				+
			
 
				+ * unknown  - op was just initialized
			
 
				+ * waiting  - op is on request_list (upward bound)
			
 
				+ * inprogr  - op is in progress (waiting for downcall)
			
 
				+ * serviced - op has matching downcall; ok
			
 
				+ * purged   - op has to start a timer since client-core
			
 
				+              exited uncleanly before servicing op
			
 
				+ * given up - submitter has given up waiting for it
			
 
				+
			
 
				+When some arbitrary userspace program needs to perform a
			
 
				+filesystem operation on Orangefs (readdir, I/O, create, whatever)
			
 
				+an op structure is initialized and tagged with a distinguishing ID
			
 
				+number. The upcall part of the op is filled out, and the op is
			
 
				+passed to the "service_operation" function.
			
 
				+
			
 
				+Service_operation changes the op's state to "waiting", puts
			
 
				+it on the request list, and signals the Orangefs file_operations.poll
			
 
				+function through a wait queue. Userspace is polling the pseudo-device
			
 
				+and thus becomes aware of the upcall request that needs to be read.
			
 
				+
			
 
				+When the Orangefs file_operations.read function is triggered, the
			
 
				+request list is searched for an op that seems ready-to-process.
			
 
				+The op is removed from the request list. The tag from the op and
			
 
				+the filled-out upcall struct are copy_to_user'ed back to userspace.
			
 
				+
			
 
				+If any of these (and some additional protocol) copy_to_users fail,
			
 
				+the op's state is set to "waiting" and the op is added back to
			
 
				+the request list. Otherwise, the op's state is changed to "in progress",
			
 
				+and the op is hashed on its tag and put onto the end of a list in the
			
 
				+in_progress hash table at the index the tag hashed to.
			
 
				+
			
 
				+When userspace has assembled the response to the upcall, it
			
 
				+writes the response, which includes the distinguishing tag, back to
			
 
				+the pseudo device in a series of io_vecs. This triggers the Orangefs
			
 
				+file_operations.write_iter function to find the op with the associated
			
 
				+tag and remove it from the in_progress hash table. As long as the op's
			
 
				+state is not "canceled" or "given up", its state is set to "serviced".
			
 
				+The file_operations.write_iter function returns to the waiting vfs,
			
 
				+and back to service_operation through wait_for_matching_downcall.
			
 
				+
			
 
				+Service operation returns to its caller with the op's downcall
			
 
				+part (the response to the upcall) filled out.
			
 
				+
			
 
				+The "client-core" is the bridge between the kernel module and
			
 
				+userspace. The client-core is a daemon. The client-core has an
			
 
				+associated watchdog daemon. If the client-core is ever signaled
			
 
				+to die, the watchdog daemon restarts the client-core. Even though
			
 
				+the client-core is restarted "right away", there is a period of
			
 
				+time during such an event that the client-core is dead. A dead client-core
			
 
				+can't be triggered by the Orangefs file_operations.poll function.
			
 
				+Ops that pass through service_operation during a "dead spell" can timeout
			
 
				+on the wait queue and one attempt is made to recycle them. Obviously,
			
 
				+if the client-core stays dead too long, the arbitrary userspace processes
			
 
				+trying to use Orangefs will be negatively affected. Waiting ops
			
 
				+that can't be serviced will be removed from the request list and
			
 
				+have their states set to "given up". In-progress ops that can't 
			
 
				+be serviced will be removed from the in_progress hash table and
			
 
				+have their states set to "given up".
			
 
				+
			
 
				+Readdir and I/O ops are atypical with respect to their payloads.
			
 
				 
			
 
				   - readdir ops use the smaller of the two pre-allocated pre-partitioned
			
 
				     memory buffers. The readdir buffer is only available to userspace.
			
@@ -311,7 +364,7 @@ particular response.
 
				     jamb everything needed to represent a pvfs2_readdir_response_t into
			
 
				     the readdir buffer descriptor specified in the upcall.
			
 
				 
			
 
				-writev() on /dev/pvfs2-req is used to pass responses to the requests
			
 
				+Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests
			
 
				 made by the kernel side.
			
 
				 
			
 
				 A buffer_list containing:
			
--- a/fs/orangefs/devorangefs-req.c
+++ b/fs/orangefs/devorangefs-req.c
@@ -404,8 +404,8 @@ static ssize_t orangefs_devreq_write_iter(struct kiocb *iocb,
 
				 
			
 
				 wakeup:
			
 
				 	/*
			
 
				-	 * tell the vfs op waiting on a waitqueue
			
 
				-	 * that this op is done
			
 
				+	 * Return to vfs waitqueue, and back to service_operation
			
 
				+	 * through wait_for_matching_downcall. 
			
 
				 	 */
			
 
				 	spin_lock(&op->lock);
			
 
				 	if (unlikely(op_is_cancel(op))) {