intel_rdt_ui.txt 26 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738
  1. User Interface for Resource Allocation in Intel Resource Director Technology
  2. Copyright (C) 2016 Intel Corporation
  3. Fenghua Yu <fenghua.yu@intel.com>
  4. Tony Luck <tony.luck@intel.com>
  5. Vikas Shivappa <vikas.shivappa@intel.com>
  6. This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
  7. X86 /proc/cpuinfo flag bits:
  8. RDT (Resource Director Technology) Allocation - "rdt_a"
  9. CAT (Cache Allocation Technology) - "cat_l3", "cat_l2"
  10. CDP (Code and Data Prioritization ) - "cdp_l3", "cdp_l2"
  11. CQM (Cache QoS Monitoring) - "cqm_llc", "cqm_occup_llc"
  12. MBM (Memory Bandwidth Monitoring) - "cqm_mbm_total", "cqm_mbm_local"
  13. MBA (Memory Bandwidth Allocation) - "mba"
  14. To use the feature mount the file system:
  15. # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl
  16. mount options are:
  17. "cdp": Enable code/data prioritization in L3 cache allocations.
  18. "cdpl2": Enable code/data prioritization in L2 cache allocations.
  19. "mba_MBps": Enable the MBA Software Controller(mba_sc) to specify MBA
  20. bandwidth in MBps
  21. L2 and L3 CDP are controlled seperately.
  22. RDT features are orthogonal. A particular system may support only
  23. monitoring, only control, or both monitoring and control.
  24. The mount succeeds if either of allocation or monitoring is present, but
  25. only those files and directories supported by the system will be created.
  26. For more details on the behavior of the interface during monitoring
  27. and allocation, see the "Resource alloc and monitor groups" section.
  28. Info directory
  29. --------------
  30. The 'info' directory contains information about the enabled
  31. resources. Each resource has its own subdirectory. The subdirectory
  32. names reflect the resource names.
  33. Each subdirectory contains the following files with respect to
  34. allocation:
  35. Cache resource(L3/L2) subdirectory contains the following files
  36. related to allocation:
  37. "num_closids": The number of CLOSIDs which are valid for this
  38. resource. The kernel uses the smallest number of
  39. CLOSIDs of all enabled resources as limit.
  40. "cbm_mask": The bitmask which is valid for this resource.
  41. This mask is equivalent to 100%.
  42. "min_cbm_bits": The minimum number of consecutive bits which
  43. must be set when writing a mask.
  44. "shareable_bits": Bitmask of shareable resource with other executing
  45. entities (e.g. I/O). User can use this when
  46. setting up exclusive cache partitions. Note that
  47. some platforms support devices that have their
  48. own settings for cache use which can over-ride
  49. these bits.
  50. Memory bandwitdh(MB) subdirectory contains the following files
  51. with respect to allocation:
  52. "min_bandwidth": The minimum memory bandwidth percentage which
  53. user can request.
  54. "bandwidth_gran": The granularity in which the memory bandwidth
  55. percentage is allocated. The allocated
  56. b/w percentage is rounded off to the next
  57. control step available on the hardware. The
  58. available bandwidth control steps are:
  59. min_bandwidth + N * bandwidth_gran.
  60. "delay_linear": Indicates if the delay scale is linear or
  61. non-linear. This field is purely informational
  62. only.
  63. If RDT monitoring is available there will be an "L3_MON" directory
  64. with the following files:
  65. "num_rmids": The number of RMIDs available. This is the
  66. upper bound for how many "CTRL_MON" + "MON"
  67. groups can be created.
  68. "mon_features": Lists the monitoring events if
  69. monitoring is enabled for the resource.
  70. "max_threshold_occupancy":
  71. Read/write file provides the largest value (in
  72. bytes) at which a previously used LLC_occupancy
  73. counter can be considered for re-use.
  74. Finally, in the top level of the "info" directory there is a file
  75. named "last_cmd_status". This is reset with every "command" issued
  76. via the file system (making new directories or writing to any of the
  77. control files). If the command was successful, it will read as "ok".
  78. If the command failed, it will provide more information that can be
  79. conveyed in the error returns from file operations. E.g.
  80. # echo L3:0=f7 > schemata
  81. bash: echo: write error: Invalid argument
  82. # cat info/last_cmd_status
  83. mask f7 has non-consecutive 1-bits
  84. Resource alloc and monitor groups
  85. ---------------------------------
  86. Resource groups are represented as directories in the resctrl file
  87. system. The default group is the root directory which, immediately
  88. after mounting, owns all the tasks and cpus in the system and can make
  89. full use of all resources.
  90. On a system with RDT control features additional directories can be
  91. created in the root directory that specify different amounts of each
  92. resource (see "schemata" below). The root and these additional top level
  93. directories are referred to as "CTRL_MON" groups below.
  94. On a system with RDT monitoring the root directory and other top level
  95. directories contain a directory named "mon_groups" in which additional
  96. directories can be created to monitor subsets of tasks in the CTRL_MON
  97. group that is their ancestor. These are called "MON" groups in the rest
  98. of this document.
  99. Removing a directory will move all tasks and cpus owned by the group it
  100. represents to the parent. Removing one of the created CTRL_MON groups
  101. will automatically remove all MON groups below it.
  102. All groups contain the following files:
  103. "tasks":
  104. Reading this file shows the list of all tasks that belong to
  105. this group. Writing a task id to the file will add a task to the
  106. group. If the group is a CTRL_MON group the task is removed from
  107. whichever previous CTRL_MON group owned the task and also from
  108. any MON group that owned the task. If the group is a MON group,
  109. then the task must already belong to the CTRL_MON parent of this
  110. group. The task is removed from any previous MON group.
  111. "cpus":
  112. Reading this file shows a bitmask of the logical CPUs owned by
  113. this group. Writing a mask to this file will add and remove
  114. CPUs to/from this group. As with the tasks file a hierarchy is
  115. maintained where MON groups may only include CPUs owned by the
  116. parent CTRL_MON group.
  117. "cpus_list":
  118. Just like "cpus", only using ranges of CPUs instead of bitmasks.
  119. When control is enabled all CTRL_MON groups will also contain:
  120. "schemata":
  121. A list of all the resources available to this group.
  122. Each resource has its own line and format - see below for details.
  123. When monitoring is enabled all MON groups will also contain:
  124. "mon_data":
  125. This contains a set of files organized by L3 domain and by
  126. RDT event. E.g. on a system with two L3 domains there will
  127. be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
  128. directories have one file per event (e.g. "llc_occupancy",
  129. "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
  130. files provide a read out of the current value of the event for
  131. all tasks in the group. In CTRL_MON groups these files provide
  132. the sum for all tasks in the CTRL_MON group and all tasks in
  133. MON groups. Please see example section for more details on usage.
  134. Resource allocation rules
  135. -------------------------
  136. When a task is running the following rules define which resources are
  137. available to it:
  138. 1) If the task is a member of a non-default group, then the schemata
  139. for that group is used.
  140. 2) Else if the task belongs to the default group, but is running on a
  141. CPU that is assigned to some specific group, then the schemata for the
  142. CPU's group is used.
  143. 3) Otherwise the schemata for the default group is used.
  144. Resource monitoring rules
  145. -------------------------
  146. 1) If a task is a member of a MON group, or non-default CTRL_MON group
  147. then RDT events for the task will be reported in that group.
  148. 2) If a task is a member of the default CTRL_MON group, but is running
  149. on a CPU that is assigned to some specific group, then the RDT events
  150. for the task will be reported in that group.
  151. 3) Otherwise RDT events for the task will be reported in the root level
  152. "mon_data" group.
  153. Notes on cache occupancy monitoring and control
  154. -----------------------------------------------
  155. When moving a task from one group to another you should remember that
  156. this only affects *new* cache allocations by the task. E.g. you may have
  157. a task in a monitor group showing 3 MB of cache occupancy. If you move
  158. to a new group and immediately check the occupancy of the old and new
  159. groups you will likely see that the old group is still showing 3 MB and
  160. the new group zero. When the task accesses locations still in cache from
  161. before the move, the h/w does not update any counters. On a busy system
  162. you will likely see the occupancy in the old group go down as cache lines
  163. are evicted and re-used while the occupancy in the new group rises as
  164. the task accesses memory and loads into the cache are counted based on
  165. membership in the new group.
  166. The same applies to cache allocation control. Moving a task to a group
  167. with a smaller cache partition will not evict any cache lines. The
  168. process may continue to use them from the old partition.
  169. Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
  170. to identify a control group and a monitoring group respectively. Each of
  171. the resource groups are mapped to these IDs based on the kind of group. The
  172. number of CLOSid and RMID are limited by the hardware and hence the creation of
  173. a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
  174. and creation of "MON" group may fail if we run out of RMIDs.
  175. max_threshold_occupancy - generic concepts
  176. ------------------------------------------
  177. Note that an RMID once freed may not be immediately available for use as
  178. the RMID is still tagged the cache lines of the previous user of RMID.
  179. Hence such RMIDs are placed on limbo list and checked back if the cache
  180. occupancy has gone down. If there is a time when system has a lot of
  181. limbo RMIDs but which are not ready to be used, user may see an -EBUSY
  182. during mkdir.
  183. max_threshold_occupancy is a user configurable value to determine the
  184. occupancy at which an RMID can be freed.
  185. Schemata files - general concepts
  186. ---------------------------------
  187. Each line in the file describes one resource. The line starts with
  188. the name of the resource, followed by specific values to be applied
  189. in each of the instances of that resource on the system.
  190. Cache IDs
  191. ---------
  192. On current generation systems there is one L3 cache per socket and L2
  193. caches are generally just shared by the hyperthreads on a core, but this
  194. isn't an architectural requirement. We could have multiple separate L3
  195. caches on a socket, multiple cores could share an L2 cache. So instead
  196. of using "socket" or "core" to define the set of logical cpus sharing
  197. a resource we use a "Cache ID". At a given cache level this will be a
  198. unique number across the whole system (but it isn't guaranteed to be a
  199. contiguous sequence, there may be gaps). To find the ID for each logical
  200. CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
  201. Cache Bit Masks (CBM)
  202. ---------------------
  203. For cache resources we describe the portion of the cache that is available
  204. for allocation using a bitmask. The maximum value of the mask is defined
  205. by each cpu model (and may be different for different cache levels). It
  206. is found using CPUID, but is also provided in the "info" directory of
  207. the resctrl file system in "info/{resource}/cbm_mask". X86 hardware
  208. requires that these masks have all the '1' bits in a contiguous block. So
  209. 0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
  210. and 0xA are not. On a system with a 20-bit mask each bit represents 5%
  211. of the capacity of the cache. You could partition the cache into four
  212. equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
  213. Memory bandwidth Allocation and monitoring
  214. ------------------------------------------
  215. For Memory bandwidth resource, by default the user controls the resource
  216. by indicating the percentage of total memory bandwidth.
  217. The minimum bandwidth percentage value for each cpu model is predefined
  218. and can be looked up through "info/MB/min_bandwidth". The bandwidth
  219. granularity that is allocated is also dependent on the cpu model and can
  220. be looked up at "info/MB/bandwidth_gran". The available bandwidth
  221. control steps are: min_bw + N * bw_gran. Intermediate values are rounded
  222. to the next control step available on the hardware.
  223. The bandwidth throttling is a core specific mechanism on some of Intel
  224. SKUs. Using a high bandwidth and a low bandwidth setting on two threads
  225. sharing a core will result in both threads being throttled to use the
  226. low bandwidth. The fact that Memory bandwidth allocation(MBA) is a core
  227. specific mechanism where as memory bandwidth monitoring(MBM) is done at
  228. the package level may lead to confusion when users try to apply control
  229. via the MBA and then monitor the bandwidth to see if the controls are
  230. effective. Below are such scenarios:
  231. 1. User may *not* see increase in actual bandwidth when percentage
  232. values are increased:
  233. This can occur when aggregate L2 external bandwidth is more than L3
  234. external bandwidth. Consider an SKL SKU with 24 cores on a package and
  235. where L2 external is 10GBps (hence aggregate L2 external bandwidth is
  236. 240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
  237. threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
  238. bandwidth of 100GBps although the percentage value specified is only 50%
  239. << 100%. Hence increasing the bandwidth percentage will not yeild any
  240. more bandwidth. This is because although the L2 external bandwidth still
  241. has capacity, the L3 external bandwidth is fully used. Also note that
  242. this would be dependent on number of cores the benchmark is run on.
  243. 2. Same bandwidth percentage may mean different actual bandwidth
  244. depending on # of threads:
  245. For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
  246. thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
  247. they have same percentage bandwidth of 10%. This is simply because as
  248. threads start using more cores in an rdtgroup, the actual bandwidth may
  249. increase or vary although user specified bandwidth percentage is same.
  250. In order to mitigate this and make the interface more user friendly,
  251. resctrl added support for specifying the bandwidth in MBps as well. The
  252. kernel underneath would use a software feedback mechanism or a "Software
  253. Controller(mba_sc)" which reads the actual bandwidth using MBM counters
  254. and adjust the memowy bandwidth percentages to ensure
  255. "actual bandwidth < user specified bandwidth".
  256. By default, the schemata would take the bandwidth percentage values
  257. where as user can switch to the "MBA software controller" mode using
  258. a mount option 'mba_MBps'. The schemata format is specified in the below
  259. sections.
  260. L3 schemata file details (code and data prioritization disabled)
  261. ----------------------------------------------------------------
  262. With CDP disabled the L3 schemata format is:
  263. L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
  264. L3 schemata file details (CDP enabled via mount option to resctrl)
  265. ------------------------------------------------------------------
  266. When CDP is enabled L3 control is split into two separate resources
  267. so you can specify independent masks for code and data like this:
  268. L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
  269. L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
  270. L2 schemata file details
  271. ------------------------
  272. L2 cache does not support code and data prioritization, so the
  273. schemata format is always:
  274. L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
  275. Memory bandwidth Allocation (default mode)
  276. ------------------------------------------
  277. Memory b/w domain is L3 cache.
  278. MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
  279. Memory bandwidth Allocation specified in MBps
  280. ---------------------------------------------
  281. Memory bandwidth domain is L3 cache.
  282. MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
  283. Reading/writing the schemata file
  284. ---------------------------------
  285. Reading the schemata file will show the state of all resources
  286. on all domains. When writing you only need to specify those values
  287. which you wish to change. E.g.
  288. # cat schemata
  289. L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
  290. L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
  291. # echo "L3DATA:2=3c0;" > schemata
  292. # cat schemata
  293. L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
  294. L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
  295. Examples for RDT allocation usage:
  296. Example 1
  297. ---------
  298. On a two socket machine (one L3 cache per socket) with just four bits
  299. for cache bit masks, minimum b/w of 10% with a memory bandwidth
  300. granularity of 10%
  301. # mount -t resctrl resctrl /sys/fs/resctrl
  302. # cd /sys/fs/resctrl
  303. # mkdir p0 p1
  304. # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
  305. # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
  306. The default resource group is unmodified, so we have access to all parts
  307. of all caches (its schemata file reads "L3:0=f;1=f").
  308. Tasks that are under the control of group "p0" may only allocate from the
  309. "lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
  310. Tasks in group "p1" use the "lower" 50% of cache on both sockets.
  311. Similarly, tasks that are under the control of group "p0" may use a
  312. maximum memory b/w of 50% on socket0 and 50% on socket 1.
  313. Tasks in group "p1" may also use 50% memory b/w on both sockets.
  314. Note that unlike cache masks, memory b/w cannot specify whether these
  315. allocations can overlap or not. The allocations specifies the maximum
  316. b/w that the group may be able to use and the system admin can configure
  317. the b/w accordingly.
  318. If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB
  319. rather than the percentage values.
  320. # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
  321. # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
  322. In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
  323. of 1024MB where as on socket 1 they would use 500MB.
  324. Example 2
  325. ---------
  326. Again two sockets, but this time with a more realistic 20-bit mask.
  327. Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
  328. processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
  329. neighbors, each of the two real-time tasks exclusively occupies one quarter
  330. of L3 cache on socket 0.
  331. # mount -t resctrl resctrl /sys/fs/resctrl
  332. # cd /sys/fs/resctrl
  333. First we reset the schemata for the default group so that the "upper"
  334. 50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
  335. ordinary tasks:
  336. # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
  337. Next we make a resource group for our first real time task and give
  338. it access to the "top" 25% of the cache on socket 0.
  339. # mkdir p0
  340. # echo "L3:0=f8000;1=fffff" > p0/schemata
  341. Finally we move our first real time task into this resource group. We
  342. also use taskset(1) to ensure the task always runs on a dedicated CPU
  343. on socket 0. Most uses of resource groups will also constrain which
  344. processors tasks run on.
  345. # echo 1234 > p0/tasks
  346. # taskset -cp 1 1234
  347. Ditto for the second real time task (with the remaining 25% of cache):
  348. # mkdir p1
  349. # echo "L3:0=7c00;1=fffff" > p1/schemata
  350. # echo 5678 > p1/tasks
  351. # taskset -cp 2 5678
  352. For the same 2 socket system with memory b/w resource and CAT L3 the
  353. schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
  354. 10):
  355. For our first real time task this would request 20% memory b/w on socket
  356. 0.
  357. # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
  358. For our second real time task this would request an other 20% memory b/w
  359. on socket 0.
  360. # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
  361. Example 3
  362. ---------
  363. A single socket system which has real-time tasks running on core 4-7 and
  364. non real-time workload assigned to core 0-3. The real-time tasks share text
  365. and data, so a per task association is not required and due to interaction
  366. with the kernel it's desired that the kernel on these cores shares L3 with
  367. the tasks.
  368. # mount -t resctrl resctrl /sys/fs/resctrl
  369. # cd /sys/fs/resctrl
  370. First we reset the schemata for the default group so that the "upper"
  371. 50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
  372. cannot be used by ordinary tasks:
  373. # echo "L3:0=3ff\nMB:0=50" > schemata
  374. Next we make a resource group for our real time cores and give it access
  375. to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
  376. socket 0.
  377. # mkdir p0
  378. # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
  379. Finally we move core 4-7 over to the new group and make sure that the
  380. kernel and the tasks running there get 50% of the cache. They should
  381. also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
  382. siblings and only the real time threads are scheduled on the cores 4-7.
  383. # echo F0 > p0/cpus
  384. 4) Locking between applications
  385. Certain operations on the resctrl filesystem, composed of read/writes
  386. to/from multiple files, must be atomic.
  387. As an example, the allocation of an exclusive reservation of L3 cache
  388. involves:
  389. 1. Read the cbmmasks from each directory
  390. 2. Find a contiguous set of bits in the global CBM bitmask that is clear
  391. in any of the directory cbmmasks
  392. 3. Create a new directory
  393. 4. Set the bits found in step 2 to the new directory "schemata" file
  394. If two applications attempt to allocate space concurrently then they can
  395. end up allocating the same bits so the reservations are shared instead of
  396. exclusive.
  397. To coordinate atomic operations on the resctrlfs and to avoid the problem
  398. above, the following locking procedure is recommended:
  399. Locking is based on flock, which is available in libc and also as a shell
  400. script command
  401. Write lock:
  402. A) Take flock(LOCK_EX) on /sys/fs/resctrl
  403. B) Read/write the directory structure.
  404. C) funlock
  405. Read lock:
  406. A) Take flock(LOCK_SH) on /sys/fs/resctrl
  407. B) If success read the directory structure.
  408. C) funlock
  409. Example with bash:
  410. # Atomically read directory structure
  411. $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
  412. # Read directory contents and create new subdirectory
  413. $ cat create-dir.sh
  414. find /sys/fs/resctrl/ > output.txt
  415. mask = function-of(output.txt)
  416. mkdir /sys/fs/resctrl/newres/
  417. echo mask > /sys/fs/resctrl/newres/schemata
  418. $ flock /sys/fs/resctrl/ ./create-dir.sh
  419. Example with C:
  420. /*
  421. * Example code do take advisory locks
  422. * before accessing resctrl filesystem
  423. */
  424. #include <sys/file.h>
  425. #include <stdlib.h>
  426. void resctrl_take_shared_lock(int fd)
  427. {
  428. int ret;
  429. /* take shared lock on resctrl filesystem */
  430. ret = flock(fd, LOCK_SH);
  431. if (ret) {
  432. perror("flock");
  433. exit(-1);
  434. }
  435. }
  436. void resctrl_take_exclusive_lock(int fd)
  437. {
  438. int ret;
  439. /* release lock on resctrl filesystem */
  440. ret = flock(fd, LOCK_EX);
  441. if (ret) {
  442. perror("flock");
  443. exit(-1);
  444. }
  445. }
  446. void resctrl_release_lock(int fd)
  447. {
  448. int ret;
  449. /* take shared lock on resctrl filesystem */
  450. ret = flock(fd, LOCK_UN);
  451. if (ret) {
  452. perror("flock");
  453. exit(-1);
  454. }
  455. }
  456. void main(void)
  457. {
  458. int fd, ret;
  459. fd = open("/sys/fs/resctrl", O_DIRECTORY);
  460. if (fd == -1) {
  461. perror("open");
  462. exit(-1);
  463. }
  464. resctrl_take_shared_lock(fd);
  465. /* code to read directory contents */
  466. resctrl_release_lock(fd);
  467. resctrl_take_exclusive_lock(fd);
  468. /* code to read and write directory contents */
  469. resctrl_release_lock(fd);
  470. }
  471. Examples for RDT Monitoring along with allocation usage:
  472. Reading monitored data
  473. ----------------------
  474. Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
  475. show the current snapshot of LLC occupancy of the corresponding MON
  476. group or CTRL_MON group.
  477. Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
  478. ---------
  479. On a two socket machine (one L3 cache per socket) with just four bits
  480. for cache bit masks
  481. # mount -t resctrl resctrl /sys/fs/resctrl
  482. # cd /sys/fs/resctrl
  483. # mkdir p0 p1
  484. # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
  485. # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
  486. # echo 5678 > p1/tasks
  487. # echo 5679 > p1/tasks
  488. The default resource group is unmodified, so we have access to all parts
  489. of all caches (its schemata file reads "L3:0=f;1=f").
  490. Tasks that are under the control of group "p0" may only allocate from the
  491. "lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
  492. Tasks in group "p1" use the "lower" 50% of cache on both sockets.
  493. Create monitor groups and assign a subset of tasks to each monitor group.
  494. # cd /sys/fs/resctrl/p1/mon_groups
  495. # mkdir m11 m12
  496. # echo 5678 > m11/tasks
  497. # echo 5679 > m12/tasks
  498. fetch data (data shown in bytes)
  499. # cat m11/mon_data/mon_L3_00/llc_occupancy
  500. 16234000
  501. # cat m11/mon_data/mon_L3_01/llc_occupancy
  502. 14789000
  503. # cat m12/mon_data/mon_L3_00/llc_occupancy
  504. 16789000
  505. The parent ctrl_mon group shows the aggregated data.
  506. # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
  507. 31234000
  508. Example 2 (Monitor a task from its creation)
  509. ---------
  510. On a two socket machine (one L3 cache per socket)
  511. # mount -t resctrl resctrl /sys/fs/resctrl
  512. # cd /sys/fs/resctrl
  513. # mkdir p0 p1
  514. An RMID is allocated to the group once its created and hence the <cmd>
  515. below is monitored from its creation.
  516. # echo $$ > /sys/fs/resctrl/p1/tasks
  517. # <cmd>
  518. Fetch the data
  519. # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
  520. 31789000
  521. Example 3 (Monitor without CAT support or before creating CAT groups)
  522. ---------
  523. Assume a system like HSW has only CQM and no CAT support. In this case
  524. the resctrl will still mount but cannot create CTRL_MON directories.
  525. But user can create different MON groups within the root group thereby
  526. able to monitor all tasks including kernel threads.
  527. This can also be used to profile jobs cache size footprint before being
  528. able to allocate them to different allocation groups.
  529. # mount -t resctrl resctrl /sys/fs/resctrl
  530. # cd /sys/fs/resctrl
  531. # mkdir mon_groups/m01
  532. # mkdir mon_groups/m02
  533. # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
  534. # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
  535. Monitor the groups separately and also get per domain data. From the
  536. below its apparent that the tasks are mostly doing work on
  537. domain(socket) 0.
  538. # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
  539. 31234000
  540. # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
  541. 34555
  542. # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
  543. 31234000
  544. # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
  545. 32789
  546. Example 4 (Monitor real time tasks)
  547. -----------------------------------
  548. A single socket system which has real time tasks running on cores 4-7
  549. and non real time tasks on other cpus. We want to monitor the cache
  550. occupancy of the real time threads on these cores.
  551. # mount -t resctrl resctrl /sys/fs/resctrl
  552. # cd /sys/fs/resctrl
  553. # mkdir p1
  554. Move the cpus 4-7 over to p1
  555. # echo f0 > p1/cpus
  556. View the llc occupancy snapshot
  557. # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
  558. 11234000