orangefs.txt 19 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529
  1. ORANGEFS
  2. ========
  3. OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal
  4. for large storage problems faced by HPC, BigData, Streaming Video,
  5. Genomics, Bioinformatics.
  6. Orangefs, originally called PVFS, was first developed in 1993 by
  7. Walt Ligon and Eric Blumer as a parallel file system for Parallel
  8. Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns
  9. of parallel programs.
  10. Orangefs features include:
  11. * Distributes file data among multiple file servers
  12. * Supports simultaneous access by multiple clients
  13. * Stores file data and metadata on servers using local file system
  14. and access methods
  15. * Userspace implementation is easy to install and maintain
  16. * Direct MPI support
  17. * Stateless
  18. MAILING LIST ARCHIVES
  19. =====================
  20. http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/
  21. MAILING LIST SUBMISSIONS
  22. ========================
  23. devel@lists.orangefs.org
  24. DOCUMENTATION
  25. =============
  26. http://www.orangefs.org/documentation/
  27. USERSPACE FILESYSTEM SOURCE
  28. ===========================
  29. http://www.orangefs.org/download
  30. Orangefs versions prior to 2.9.3 would not be compatible with the
  31. upstream version of the kernel client.
  32. RUNNING ORANGEFS ON A SINGLE SERVER
  33. ===================================
  34. OrangeFS is usually run in large installations with multiple servers and
  35. clients, but a complete filesystem can be run on a single machine for
  36. development and testing.
  37. On Fedora, install orangefs and orangefs-server.
  38. dnf -y install orangefs orangefs-server
  39. There is an example server configuration file in
  40. /etc/orangefs/orangefs.conf. Change localhost to your hostname if
  41. necessary.
  42. To generate a filesystem to run xfstests against, see below.
  43. There is an example client configuration file in /etc/pvfs2tab. It is a
  44. single line. Uncomment it and change the hostname if necessary. This
  45. controls clients which use libpvfs2. This does not control the
  46. pvfs2-client-core.
  47. Create the filesystem.
  48. pvfs2-server -f /etc/orangefs/orangefs.conf
  49. Start the server.
  50. systemctl start orangefs-server
  51. Test the server.
  52. pvfs2-ping -m /pvfsmnt
  53. Start the client. The module must be compiled in or loaded before this
  54. point.
  55. systemctl start orangefs-client
  56. Mount the filesystem.
  57. mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
  58. BUILDING ORANGEFS ON A SINGLE SERVER
  59. ====================================
  60. Where OrangeFS cannot be installed from distribution packages, it may be
  61. built from source.
  62. You can omit --prefix if you don't care that things are sprinkled around
  63. in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by
  64. default, we will probably be changing the default to LMDB soon.
  65. ./configure --prefix=/opt/ofs --with-db-backend=lmdb
  66. make
  67. make install
  68. Create an orangefs config file.
  69. /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
  70. Create an /etc/pvfs2tab file.
  71. echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
  72. /etc/pvfs2tab
  73. Create the mount point you specified in the tab file if needed.
  74. mkdir /pvfsmnt
  75. Bootstrap the server.
  76. /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf
  77. Start the server.
  78. /opt/osf/sbin/pvfs2-server /etc/pvfs2.conf
  79. Now the server should be running. Pvfs2-ls is a simple
  80. test to verify that the server is running.
  81. /opt/ofs/bin/pvfs2-ls /pvfsmnt
  82. If stuff seems to be working, load the kernel module and
  83. turn on the client core.
  84. /opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core
  85. Mount your filesystem.
  86. mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
  87. RUNNING XFSTESTS
  88. ================
  89. It is useful to use a scratch filesystem with xfstests. This can be
  90. done with only one server.
  91. Make a second copy of the FileSystem section in the server configuration
  92. file, which is /etc/orangefs/orangefs.conf. Change the Name to scratch.
  93. Change the ID to something other than the ID of the first FileSystem
  94. section (2 is usually a good choice).
  95. Then there are two FileSystem sections: orangefs and scratch.
  96. This change should be made before creating the filesystem.
  97. pvfs2-server -f /etc/orangefs/orangefs.conf
  98. To run xfstests, create /etc/xfsqa.config.
  99. TEST_DIR=/orangefs
  100. TEST_DEV=tcp://localhost:3334/orangefs
  101. SCRATCH_MNT=/scratch
  102. SCRATCH_DEV=tcp://localhost:3334/scratch
  103. Then xfstests can be run
  104. ./check -pvfs2
  105. OPTIONS
  106. =======
  107. The following mount options are accepted:
  108. acl
  109. Allow the use of Access Control Lists on files and directories.
  110. intr
  111. Some operations between the kernel client and the user space
  112. filesystem can be interruptible, such as changes in debug levels
  113. and the setting of tunable parameters.
  114. local_lock
  115. Enable posix locking from the perspective of "this" kernel. The
  116. default file_operations lock action is to return ENOSYS. Posix
  117. locking kicks in if the filesystem is mounted with -o local_lock.
  118. Distributed locking is being worked on for the future.
  119. DEBUGGING
  120. =========
  121. If you want the debug (GOSSIP) statements in a particular
  122. source file (inode.c for example) go to syslog:
  123. echo inode > /sys/kernel/debug/orangefs/kernel-debug
  124. No debugging (the default):
  125. echo none > /sys/kernel/debug/orangefs/kernel-debug
  126. Debugging from several source files:
  127. echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug
  128. All debugging:
  129. echo all > /sys/kernel/debug/orangefs/kernel-debug
  130. Get a list of all debugging keywords:
  131. cat /sys/kernel/debug/orangefs/debug-help
  132. PROTOCOL BETWEEN KERNEL MODULE AND USERSPACE
  133. ============================================
  134. Orangefs is a user space filesystem and an associated kernel module.
  135. We'll just refer to the user space part of Orangefs as "userspace"
  136. from here on out. Orangefs descends from PVFS, and userspace code
  137. still uses PVFS for function and variable names. Userspace typedefs
  138. many of the important structures. Function and variable names in
  139. the kernel module have been transitioned to "orangefs", and The Linux
  140. Coding Style avoids typedefs, so kernel module structures that
  141. correspond to userspace structures are not typedefed.
  142. The kernel module implements a pseudo device that userspace
  143. can read from and write to. Userspace can also manipulate the
  144. kernel module through the pseudo device with ioctl.
  145. THE BUFMAP:
  146. At startup userspace allocates two page-size-aligned (posix_memalign)
  147. mlocked memory buffers, one is used for IO and one is used for readdir
  148. operations. The IO buffer is 41943040 bytes and the readdir buffer is
  149. 4194304 bytes. Each buffer contains logical chunks, or partitions, and
  150. a pointer to each buffer is added to its own PVFS_dev_map_desc structure
  151. which also describes its total size, as well as the size and number of
  152. the partitions.
  153. A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a
  154. mapping routine in the kernel module with an ioctl. The structure is
  155. copied from user space to kernel space with copy_from_user and is used
  156. to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which
  157. then contains:
  158. * refcnt - a reference counter
  159. * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's
  160. partition size, which represents the filesystem's block size and
  161. is used for s_blocksize in super blocks.
  162. * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of
  163. partitions in the IO buffer.
  164. * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks.
  165. * total_size - the total size of the IO buffer.
  166. * page_count - the number of 4096 byte pages in the IO buffer.
  167. * page_array - a pointer to page_count * (sizeof(struct page*)) bytes
  168. of kcalloced memory. This memory is used as an array of pointers
  169. to each of the pages in the IO buffer through a call to get_user_pages.
  170. * desc_array - a pointer to desc_count * (sizeof(struct orangefs_bufmap_desc))
  171. bytes of kcalloced memory. This memory is further intialized:
  172. user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc
  173. structure. user_desc->ptr points to the IO buffer.
  174. pages_per_desc = bufmap->desc_size / PAGE_SIZE
  175. offset = 0
  176. bufmap->desc_array[0].page_array = &bufmap->page_array[offset]
  177. bufmap->desc_array[0].array_count = pages_per_desc = 1024
  178. bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096)
  179. offset += 1024
  180. .
  181. .
  182. .
  183. bufmap->desc_array[9].page_array = &bufmap->page_array[offset]
  184. bufmap->desc_array[9].array_count = pages_per_desc = 1024
  185. bufmap->desc_array[9].uaddr = (user_desc->ptr) +
  186. (9 * 1024 * 4096)
  187. offset += 1024
  188. * buffer_index_array - a desc_count sized array of ints, used to
  189. indicate which of the IO buffer's partitions are available to use.
  190. * buffer_index_lock - a spinlock to protect buffer_index_array during update.
  191. * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element
  192. int array used to indicate which of the readdir buffer's partitions are
  193. available to use.
  194. * readdir_index_lock - a spinlock to protect readdir_index_array during
  195. update.
  196. OPERATIONS:
  197. The kernel module builds an "op" (struct orangefs_kernel_op_s) when it
  198. needs to communicate with userspace. Part of the op contains the "upcall"
  199. which expresses the request to userspace. Part of the op eventually
  200. contains the "downcall" which expresses the results of the request.
  201. The slab allocator is used to keep a cache of op structures handy.
  202. At init time the kernel module defines and initializes a request list
  203. and an in_progress hash table to keep track of all the ops that are
  204. in flight at any given time.
  205. Ops are stateful:
  206. * unknown - op was just initialized
  207. * waiting - op is on request_list (upward bound)
  208. * inprogr - op is in progress (waiting for downcall)
  209. * serviced - op has matching downcall; ok
  210. * purged - op has to start a timer since client-core
  211. exited uncleanly before servicing op
  212. * given up - submitter has given up waiting for it
  213. When some arbitrary userspace program needs to perform a
  214. filesystem operation on Orangefs (readdir, I/O, create, whatever)
  215. an op structure is initialized and tagged with a distinguishing ID
  216. number. The upcall part of the op is filled out, and the op is
  217. passed to the "service_operation" function.
  218. Service_operation changes the op's state to "waiting", puts
  219. it on the request list, and signals the Orangefs file_operations.poll
  220. function through a wait queue. Userspace is polling the pseudo-device
  221. and thus becomes aware of the upcall request that needs to be read.
  222. When the Orangefs file_operations.read function is triggered, the
  223. request list is searched for an op that seems ready-to-process.
  224. The op is removed from the request list. The tag from the op and
  225. the filled-out upcall struct are copy_to_user'ed back to userspace.
  226. If any of these (and some additional protocol) copy_to_users fail,
  227. the op's state is set to "waiting" and the op is added back to
  228. the request list. Otherwise, the op's state is changed to "in progress",
  229. and the op is hashed on its tag and put onto the end of a list in the
  230. in_progress hash table at the index the tag hashed to.
  231. When userspace has assembled the response to the upcall, it
  232. writes the response, which includes the distinguishing tag, back to
  233. the pseudo device in a series of io_vecs. This triggers the Orangefs
  234. file_operations.write_iter function to find the op with the associated
  235. tag and remove it from the in_progress hash table. As long as the op's
  236. state is not "canceled" or "given up", its state is set to "serviced".
  237. The file_operations.write_iter function returns to the waiting vfs,
  238. and back to service_operation through wait_for_matching_downcall.
  239. Service operation returns to its caller with the op's downcall
  240. part (the response to the upcall) filled out.
  241. The "client-core" is the bridge between the kernel module and
  242. userspace. The client-core is a daemon. The client-core has an
  243. associated watchdog daemon. If the client-core is ever signaled
  244. to die, the watchdog daemon restarts the client-core. Even though
  245. the client-core is restarted "right away", there is a period of
  246. time during such an event that the client-core is dead. A dead client-core
  247. can't be triggered by the Orangefs file_operations.poll function.
  248. Ops that pass through service_operation during a "dead spell" can timeout
  249. on the wait queue and one attempt is made to recycle them. Obviously,
  250. if the client-core stays dead too long, the arbitrary userspace processes
  251. trying to use Orangefs will be negatively affected. Waiting ops
  252. that can't be serviced will be removed from the request list and
  253. have their states set to "given up". In-progress ops that can't
  254. be serviced will be removed from the in_progress hash table and
  255. have their states set to "given up".
  256. Readdir and I/O ops are atypical with respect to their payloads.
  257. - readdir ops use the smaller of the two pre-allocated pre-partitioned
  258. memory buffers. The readdir buffer is only available to userspace.
  259. The kernel module obtains an index to a free partition before launching
  260. a readdir op. Userspace deposits the results into the indexed partition
  261. and then writes them to back to the pvfs device.
  262. - io (read and write) ops use the larger of the two pre-allocated
  263. pre-partitioned memory buffers. The IO buffer is accessible from
  264. both userspace and the kernel module. The kernel module obtains an
  265. index to a free partition before launching an io op. The kernel module
  266. deposits write data into the indexed partition, to be consumed
  267. directly by userspace. Userspace deposits the results of read
  268. requests into the indexed partition, to be consumed directly
  269. by the kernel module.
  270. Responses to kernel requests are all packaged in pvfs2_downcall_t
  271. structs. Besides a few other members, pvfs2_downcall_t contains a
  272. union of structs, each of which is associated with a particular
  273. response type.
  274. The several members outside of the union are:
  275. - int32_t type - type of operation.
  276. - int32_t status - return code for the operation.
  277. - int64_t trailer_size - 0 unless readdir operation.
  278. - char *trailer_buf - initialized to NULL, used during readdir operations.
  279. The appropriate member inside the union is filled out for any
  280. particular response.
  281. PVFS2_VFS_OP_FILE_IO
  282. fill a pvfs2_io_response_t
  283. PVFS2_VFS_OP_LOOKUP
  284. fill a PVFS_object_kref
  285. PVFS2_VFS_OP_CREATE
  286. fill a PVFS_object_kref
  287. PVFS2_VFS_OP_SYMLINK
  288. fill a PVFS_object_kref
  289. PVFS2_VFS_OP_GETATTR
  290. fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need)
  291. fill in a string with the link target when the object is a symlink.
  292. PVFS2_VFS_OP_MKDIR
  293. fill a PVFS_object_kref
  294. PVFS2_VFS_OP_STATFS
  295. fill a pvfs2_statfs_response_t with useless info <g>. It is hard for
  296. us to know, in a timely fashion, these statistics about our
  297. distributed network filesystem.
  298. PVFS2_VFS_OP_FS_MOUNT
  299. fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref
  300. except its members are in a different order and "__pad1" is replaced
  301. with "id".
  302. PVFS2_VFS_OP_GETXATTR
  303. fill a pvfs2_getxattr_response_t
  304. PVFS2_VFS_OP_LISTXATTR
  305. fill a pvfs2_listxattr_response_t
  306. PVFS2_VFS_OP_PARAM
  307. fill a pvfs2_param_response_t
  308. PVFS2_VFS_OP_PERF_COUNT
  309. fill a pvfs2_perf_count_response_t
  310. PVFS2_VFS_OP_FSKEY
  311. file a pvfs2_fs_key_response_t
  312. PVFS2_VFS_OP_READDIR
  313. jamb everything needed to represent a pvfs2_readdir_response_t into
  314. the readdir buffer descriptor specified in the upcall.
  315. Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests
  316. made by the kernel side.
  317. A buffer_list containing:
  318. - a pointer to the prepared response to the request from the
  319. kernel (struct pvfs2_downcall_t).
  320. - and also, in the case of a readdir request, a pointer to a
  321. buffer containing descriptors for the objects in the target
  322. directory.
  323. ... is sent to the function (PINT_dev_write_list) which performs
  324. the writev.
  325. PINT_dev_write_list has a local iovec array: struct iovec io_array[10];
  326. The first four elements of io_array are initialized like this for all
  327. responses:
  328. io_array[0].iov_base = address of local variable "proto_ver" (int32_t)
  329. io_array[0].iov_len = sizeof(int32_t)
  330. io_array[1].iov_base = address of global variable "pdev_magic" (int32_t)
  331. io_array[1].iov_len = sizeof(int32_t)
  332. io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t)
  333. io_array[2].iov_len = sizeof(int64_t)
  334. io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t)
  335. of global variable vfs_request (vfs_request_t)
  336. io_array[3].iov_len = sizeof(pvfs2_downcall_t)
  337. Readdir responses initialize the fifth element io_array like this:
  338. io_array[4].iov_base = contents of member trailer_buf (char *)
  339. from out_downcall member of global variable
  340. vfs_request
  341. io_array[4].iov_len = contents of member trailer_size (PVFS_size)
  342. from out_downcall member of global variable
  343. vfs_request
  344. Orangefs exploits the dcache in order to avoid sending redundant
  345. requests to userspace. We keep object inode attributes up-to-date with
  346. orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to
  347. help it decide whether or not to update an inode: "new" and "bypass".
  348. Orangefs keeps private data in an object's inode that includes a short
  349. timeout value, getattr_time, which allows any iteration of
  350. orangefs_inode_getattr to know how long it has been since the inode was
  351. updated. When the object is not new (new == 0) and the bypass flag is not
  352. set (bypass == 0) orangefs_inode_getattr returns without updating the inode
  353. if getattr_time has not timed out. Getattr_time is updated each time the
  354. inode is updated.
  355. Creation of a new object (file, dir, sym-link) includes the evaluation of
  356. its pathname, resulting in a negative directory entry for the object.
  357. A new inode is allocated and associated with the dentry, turning it from
  358. a negative dentry into a "productive full member of society". Orangefs
  359. obtains the new inode from Linux with new_inode() and associates
  360. the inode with the dentry by sending the pair back to Linux with
  361. d_instantiate().
  362. The evaluation of a pathname for an object resolves to its corresponding
  363. dentry. If there is no corresponding dentry, one is created for it in
  364. the dcache. Whenever a dentry is modified or verified Orangefs stores a
  365. short timeout value in the dentry's d_time, and the dentry will be trusted
  366. for that amount of time. Orangefs is a network filesystem, and objects
  367. can potentially change out-of-band with any particular Orangefs kernel module
  368. instance, so trusting a dentry is risky. The alternative to trusting
  369. dentries is to always obtain the needed information from userspace - at
  370. least a trip to the client-core, maybe to the servers. Obtaining information
  371. from a dentry is cheap, obtaining it from userspace is relatively expensive,
  372. hence the motivation to use the dentry when possible.
  373. The timeout values d_time and getattr_time are jiffy based, and the
  374. code is designed to avoid the jiffy-wrap problem:
  375. "In general, if the clock may have wrapped around more than once, there
  376. is no way to tell how much time has elapsed. However, if the times t1
  377. and t2 are known to be fairly close, we can reliably compute the
  378. difference in a way that takes into account the possibility that the
  379. clock may have wrapped between times."
  380. from course notes by instructor Andy Wang