|
@@ -1,7 +1,9 @@
|
|
-
|
|
|
|
|
|
+================
|
|
Control Group v2
|
|
Control Group v2
|
|
|
|
+================
|
|
|
|
|
|
-October, 2015 Tejun Heo <tj@kernel.org>
|
|
|
|
|
|
+:Date: October, 2015
|
|
|
|
+:Author: Tejun Heo <tj@kernel.org>
|
|
|
|
|
|
This is the authoritative documentation on the design, interface and
|
|
This is the authoritative documentation on the design, interface and
|
|
conventions of cgroup v2. It describes all userland-visible aspects
|
|
conventions of cgroup v2. It describes all userland-visible aspects
|
|
@@ -9,70 +11,72 @@ of cgroup including core and specific controller behaviors. All
|
|
future changes must be reflected in this document. Documentation for
|
|
future changes must be reflected in this document. Documentation for
|
|
v1 is available under Documentation/cgroup-v1/.
|
|
v1 is available under Documentation/cgroup-v1/.
|
|
|
|
|
|
-CONTENTS
|
|
|
|
-
|
|
|
|
-1. Introduction
|
|
|
|
- 1-1. Terminology
|
|
|
|
- 1-2. What is cgroup?
|
|
|
|
-2. Basic Operations
|
|
|
|
- 2-1. Mounting
|
|
|
|
- 2-2. Organizing Processes
|
|
|
|
- 2-3. [Un]populated Notification
|
|
|
|
- 2-4. Controlling Controllers
|
|
|
|
- 2-4-1. Enabling and Disabling
|
|
|
|
- 2-4-2. Top-down Constraint
|
|
|
|
- 2-4-3. No Internal Process Constraint
|
|
|
|
- 2-5. Delegation
|
|
|
|
- 2-5-1. Model of Delegation
|
|
|
|
- 2-5-2. Delegation Containment
|
|
|
|
- 2-6. Guidelines
|
|
|
|
- 2-6-1. Organize Once and Control
|
|
|
|
- 2-6-2. Avoid Name Collisions
|
|
|
|
-3. Resource Distribution Models
|
|
|
|
- 3-1. Weights
|
|
|
|
- 3-2. Limits
|
|
|
|
- 3-3. Protections
|
|
|
|
- 3-4. Allocations
|
|
|
|
-4. Interface Files
|
|
|
|
- 4-1. Format
|
|
|
|
- 4-2. Conventions
|
|
|
|
- 4-3. Core Interface Files
|
|
|
|
-5. Controllers
|
|
|
|
- 5-1. CPU
|
|
|
|
- 5-1-1. CPU Interface Files
|
|
|
|
- 5-2. Memory
|
|
|
|
- 5-2-1. Memory Interface Files
|
|
|
|
- 5-2-2. Usage Guidelines
|
|
|
|
- 5-2-3. Memory Ownership
|
|
|
|
- 5-3. IO
|
|
|
|
- 5-3-1. IO Interface Files
|
|
|
|
- 5-3-2. Writeback
|
|
|
|
- 5-4. PID
|
|
|
|
- 5-4-1. PID Interface Files
|
|
|
|
- 5-5. RDMA
|
|
|
|
- 5-5-1. RDMA Interface Files
|
|
|
|
- 5-6. Misc
|
|
|
|
- 5-6-1. perf_event
|
|
|
|
-6. Namespace
|
|
|
|
- 6-1. Basics
|
|
|
|
- 6-2. The Root and Views
|
|
|
|
- 6-3. Migration and setns(2)
|
|
|
|
- 6-4. Interaction with Other Namespaces
|
|
|
|
-P. Information on Kernel Programming
|
|
|
|
- P-1. Filesystem Support for Writeback
|
|
|
|
-D. Deprecated v1 Core Features
|
|
|
|
-R. Issues with v1 and Rationales for v2
|
|
|
|
- R-1. Multiple Hierarchies
|
|
|
|
- R-2. Thread Granularity
|
|
|
|
- R-3. Competition Between Inner Nodes and Threads
|
|
|
|
- R-4. Other Interface Issues
|
|
|
|
- R-5. Controller Issues and Remedies
|
|
|
|
- R-5-1. Memory
|
|
|
|
-
|
|
|
|
-
|
|
|
|
-1. Introduction
|
|
|
|
-
|
|
|
|
-1-1. Terminology
|
|
|
|
|
|
+.. CONTENTS
|
|
|
|
+
|
|
|
|
+ 1. Introduction
|
|
|
|
+ 1-1. Terminology
|
|
|
|
+ 1-2. What is cgroup?
|
|
|
|
+ 2. Basic Operations
|
|
|
|
+ 2-1. Mounting
|
|
|
|
+ 2-2. Organizing Processes
|
|
|
|
+ 2-3. [Un]populated Notification
|
|
|
|
+ 2-4. Controlling Controllers
|
|
|
|
+ 2-4-1. Enabling and Disabling
|
|
|
|
+ 2-4-2. Top-down Constraint
|
|
|
|
+ 2-4-3. No Internal Process Constraint
|
|
|
|
+ 2-5. Delegation
|
|
|
|
+ 2-5-1. Model of Delegation
|
|
|
|
+ 2-5-2. Delegation Containment
|
|
|
|
+ 2-6. Guidelines
|
|
|
|
+ 2-6-1. Organize Once and Control
|
|
|
|
+ 2-6-2. Avoid Name Collisions
|
|
|
|
+ 3. Resource Distribution Models
|
|
|
|
+ 3-1. Weights
|
|
|
|
+ 3-2. Limits
|
|
|
|
+ 3-3. Protections
|
|
|
|
+ 3-4. Allocations
|
|
|
|
+ 4. Interface Files
|
|
|
|
+ 4-1. Format
|
|
|
|
+ 4-2. Conventions
|
|
|
|
+ 4-3. Core Interface Files
|
|
|
|
+ 5. Controllers
|
|
|
|
+ 5-1. CPU
|
|
|
|
+ 5-1-1. CPU Interface Files
|
|
|
|
+ 5-2. Memory
|
|
|
|
+ 5-2-1. Memory Interface Files
|
|
|
|
+ 5-2-2. Usage Guidelines
|
|
|
|
+ 5-2-3. Memory Ownership
|
|
|
|
+ 5-3. IO
|
|
|
|
+ 5-3-1. IO Interface Files
|
|
|
|
+ 5-3-2. Writeback
|
|
|
|
+ 5-4. PID
|
|
|
|
+ 5-4-1. PID Interface Files
|
|
|
|
+ 5-5. RDMA
|
|
|
|
+ 5-5-1. RDMA Interface Files
|
|
|
|
+ 5-6. Misc
|
|
|
|
+ 5-6-1. perf_event
|
|
|
|
+ 6. Namespace
|
|
|
|
+ 6-1. Basics
|
|
|
|
+ 6-2. The Root and Views
|
|
|
|
+ 6-3. Migration and setns(2)
|
|
|
|
+ 6-4. Interaction with Other Namespaces
|
|
|
|
+ P. Information on Kernel Programming
|
|
|
|
+ P-1. Filesystem Support for Writeback
|
|
|
|
+ D. Deprecated v1 Core Features
|
|
|
|
+ R. Issues with v1 and Rationales for v2
|
|
|
|
+ R-1. Multiple Hierarchies
|
|
|
|
+ R-2. Thread Granularity
|
|
|
|
+ R-3. Competition Between Inner Nodes and Threads
|
|
|
|
+ R-4. Other Interface Issues
|
|
|
|
+ R-5. Controller Issues and Remedies
|
|
|
|
+ R-5-1. Memory
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+Introduction
|
|
|
|
+============
|
|
|
|
+
|
|
|
|
+Terminology
|
|
|
|
+-----------
|
|
|
|
|
|
"cgroup" stands for "control group" and is never capitalized. The
|
|
"cgroup" stands for "control group" and is never capitalized. The
|
|
singular form is used to designate the whole feature and also as a
|
|
singular form is used to designate the whole feature and also as a
|
|
@@ -80,7 +84,8 @@ qualifier as in "cgroup controllers". When explicitly referring to
|
|
multiple individual control groups, the plural form "cgroups" is used.
|
|
multiple individual control groups, the plural form "cgroups" is used.
|
|
|
|
|
|
|
|
|
|
-1-2. What is cgroup?
|
|
|
|
|
|
+What is cgroup?
|
|
|
|
+---------------
|
|
|
|
|
|
cgroup is a mechanism to organize processes hierarchically and
|
|
cgroup is a mechanism to organize processes hierarchically and
|
|
distribute system resources along the hierarchy in a controlled and
|
|
distribute system resources along the hierarchy in a controlled and
|
|
@@ -110,12 +115,14 @@ restrictions set closer to the root in the hierarchy can not be
|
|
overridden from further away.
|
|
overridden from further away.
|
|
|
|
|
|
|
|
|
|
-2. Basic Operations
|
|
|
|
|
|
+Basic Operations
|
|
|
|
+================
|
|
|
|
|
|
-2-1. Mounting
|
|
|
|
|
|
+Mounting
|
|
|
|
+--------
|
|
|
|
|
|
Unlike v1, cgroup v2 has only single hierarchy. The cgroup v2
|
|
Unlike v1, cgroup v2 has only single hierarchy. The cgroup v2
|
|
-hierarchy can be mounted with the following mount command.
|
|
|
|
|
|
+hierarchy can be mounted with the following mount command::
|
|
|
|
|
|
# mount -t cgroup2 none $MOUNT_POINT
|
|
# mount -t cgroup2 none $MOUNT_POINT
|
|
|
|
|
|
@@ -160,10 +167,11 @@ cgroup v2 currently supports the following mount options.
|
|
Delegation section for details.
|
|
Delegation section for details.
|
|
|
|
|
|
|
|
|
|
-2-2. Organizing Processes
|
|
|
|
|
|
+Organizing Processes
|
|
|
|
+--------------------
|
|
|
|
|
|
Initially, only the root cgroup exists to which all processes belong.
|
|
Initially, only the root cgroup exists to which all processes belong.
|
|
-A child cgroup can be created by creating a sub-directory.
|
|
|
|
|
|
+A child cgroup can be created by creating a sub-directory::
|
|
|
|
|
|
# mkdir $CGROUP_NAME
|
|
# mkdir $CGROUP_NAME
|
|
|
|
|
|
@@ -190,28 +198,29 @@ moved to another cgroup.
|
|
A cgroup which doesn't have any children or live processes can be
|
|
A cgroup which doesn't have any children or live processes can be
|
|
destroyed by removing the directory. Note that a cgroup which doesn't
|
|
destroyed by removing the directory. Note that a cgroup which doesn't
|
|
have any children and is associated only with zombie processes is
|
|
have any children and is associated only with zombie processes is
|
|
-considered empty and can be removed.
|
|
|
|
|
|
+considered empty and can be removed::
|
|
|
|
|
|
# rmdir $CGROUP_NAME
|
|
# rmdir $CGROUP_NAME
|
|
|
|
|
|
"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy
|
|
"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy
|
|
cgroup is in use in the system, this file may contain multiple lines,
|
|
cgroup is in use in the system, this file may contain multiple lines,
|
|
one for each hierarchy. The entry for cgroup v2 is always in the
|
|
one for each hierarchy. The entry for cgroup v2 is always in the
|
|
-format "0::$PATH".
|
|
|
|
|
|
+format "0::$PATH"::
|
|
|
|
|
|
# cat /proc/842/cgroup
|
|
# cat /proc/842/cgroup
|
|
...
|
|
...
|
|
0::/test-cgroup/test-cgroup-nested
|
|
0::/test-cgroup/test-cgroup-nested
|
|
|
|
|
|
If the process becomes a zombie and the cgroup it was associated with
|
|
If the process becomes a zombie and the cgroup it was associated with
|
|
-is removed subsequently, " (deleted)" is appended to the path.
|
|
|
|
|
|
+is removed subsequently, " (deleted)" is appended to the path::
|
|
|
|
|
|
# cat /proc/842/cgroup
|
|
# cat /proc/842/cgroup
|
|
...
|
|
...
|
|
0::/test-cgroup/test-cgroup-nested (deleted)
|
|
0::/test-cgroup/test-cgroup-nested (deleted)
|
|
|
|
|
|
|
|
|
|
-2-3. [Un]populated Notification
|
|
|
|
|
|
+[Un]populated Notification
|
|
|
|
+--------------------------
|
|
|
|
|
|
Each non-root cgroup has a "cgroup.events" file which contains
|
|
Each non-root cgroup has a "cgroup.events" file which contains
|
|
"populated" field indicating whether the cgroup's sub-hierarchy has
|
|
"populated" field indicating whether the cgroup's sub-hierarchy has
|
|
@@ -222,7 +231,7 @@ example, to start a clean-up operation after all processes of a given
|
|
sub-hierarchy have exited. The populated state updates and
|
|
sub-hierarchy have exited. The populated state updates and
|
|
notifications are recursive. Consider the following sub-hierarchy
|
|
notifications are recursive. Consider the following sub-hierarchy
|
|
where the numbers in the parentheses represent the numbers of processes
|
|
where the numbers in the parentheses represent the numbers of processes
|
|
-in each cgroup.
|
|
|
|
|
|
+in each cgroup::
|
|
|
|
|
|
A(4) - B(0) - C(1)
|
|
A(4) - B(0) - C(1)
|
|
\ D(0)
|
|
\ D(0)
|
|
@@ -233,18 +242,20 @@ file modified events will be generated on the "cgroup.events" files of
|
|
both cgroups.
|
|
both cgroups.
|
|
|
|
|
|
|
|
|
|
-2-4. Controlling Controllers
|
|
|
|
|
|
+Controlling Controllers
|
|
|
|
+-----------------------
|
|
|
|
|
|
-2-4-1. Enabling and Disabling
|
|
|
|
|
|
+Enabling and Disabling
|
|
|
|
+~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
Each cgroup has a "cgroup.controllers" file which lists all
|
|
Each cgroup has a "cgroup.controllers" file which lists all
|
|
-controllers available for the cgroup to enable.
|
|
|
|
|
|
+controllers available for the cgroup to enable::
|
|
|
|
|
|
# cat cgroup.controllers
|
|
# cat cgroup.controllers
|
|
cpu io memory
|
|
cpu io memory
|
|
|
|
|
|
No controller is enabled by default. Controllers can be enabled and
|
|
No controller is enabled by default. Controllers can be enabled and
|
|
-disabled by writing to the "cgroup.subtree_control" file.
|
|
|
|
|
|
+disabled by writing to the "cgroup.subtree_control" file::
|
|
|
|
|
|
# echo "+cpu +memory -io" > cgroup.subtree_control
|
|
# echo "+cpu +memory -io" > cgroup.subtree_control
|
|
|
|
|
|
@@ -256,7 +267,7 @@ are specified, the last one is effective.
|
|
Enabling a controller in a cgroup indicates that the distribution of
|
|
Enabling a controller in a cgroup indicates that the distribution of
|
|
the target resource across its immediate children will be controlled.
|
|
the target resource across its immediate children will be controlled.
|
|
Consider the following sub-hierarchy. The enabled controllers are
|
|
Consider the following sub-hierarchy. The enabled controllers are
|
|
-listed in parentheses.
|
|
|
|
|
|
+listed in parentheses::
|
|
|
|
|
|
A(cpu,memory) - B(memory) - C()
|
|
A(cpu,memory) - B(memory) - C()
|
|
\ D()
|
|
\ D()
|
|
@@ -276,7 +287,8 @@ controller interface files - anything which doesn't start with
|
|
"cgroup." are owned by the parent rather than the cgroup itself.
|
|
"cgroup." are owned by the parent rather than the cgroup itself.
|
|
|
|
|
|
|
|
|
|
-2-4-2. Top-down Constraint
|
|
|
|
|
|
+Top-down Constraint
|
|
|
|
+~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
Resources are distributed top-down and a cgroup can further distribute
|
|
Resources are distributed top-down and a cgroup can further distribute
|
|
a resource only if the resource has been distributed to it from the
|
|
a resource only if the resource has been distributed to it from the
|
|
@@ -287,7 +299,8 @@ the parent has the controller enabled and a controller can't be
|
|
disabled if one or more children have it enabled.
|
|
disabled if one or more children have it enabled.
|
|
|
|
|
|
|
|
|
|
-2-4-3. No Internal Process Constraint
|
|
|
|
|
|
+No Internal Process Constraint
|
|
|
|
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
Non-root cgroups can only distribute resources to their children when
|
|
Non-root cgroups can only distribute resources to their children when
|
|
they don't have any processes of their own. In other words, only
|
|
they don't have any processes of their own. In other words, only
|
|
@@ -314,9 +327,11 @@ children before enabling controllers in its "cgroup.subtree_control"
|
|
file.
|
|
file.
|
|
|
|
|
|
|
|
|
|
-2-5. Delegation
|
|
|
|
|
|
+Delegation
|
|
|
|
+----------
|
|
|
|
|
|
-2-5-1. Model of Delegation
|
|
|
|
|
|
+Model of Delegation
|
|
|
|
+~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
A cgroup can be delegated in two ways. First, to a less privileged
|
|
A cgroup can be delegated in two ways. First, to a less privileged
|
|
user by granting write access of the directory and its "cgroup.procs"
|
|
user by granting write access of the directory and its "cgroup.procs"
|
|
@@ -345,7 +360,8 @@ cgroups in or nesting depth of a delegated sub-hierarchy; however,
|
|
this may be limited explicitly in the future.
|
|
this may be limited explicitly in the future.
|
|
|
|
|
|
|
|
|
|
-2-5-2. Delegation Containment
|
|
|
|
|
|
+Delegation Containment
|
|
|
|
+~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
A delegated sub-hierarchy is contained in the sense that processes
|
|
A delegated sub-hierarchy is contained in the sense that processes
|
|
can't be moved into or out of the sub-hierarchy by the delegatee.
|
|
can't be moved into or out of the sub-hierarchy by the delegatee.
|
|
@@ -366,7 +382,7 @@ in from or push out to outside the sub-hierarchy.
|
|
|
|
|
|
For an example, let's assume cgroups C0 and C1 have been delegated to
|
|
For an example, let's assume cgroups C0 and C1 have been delegated to
|
|
user U0 who created C00, C01 under C0 and C10 under C1 as follows and
|
|
user U0 who created C00, C01 under C0 and C10 under C1 as follows and
|
|
-all processes under C0 and C1 belong to U0.
|
|
|
|
|
|
+all processes under C0 and C1 belong to U0::
|
|
|
|
|
|
~~~~~~~~~~~~~ - C0 - C00
|
|
~~~~~~~~~~~~~ - C0 - C00
|
|
~ cgroup ~ \ C01
|
|
~ cgroup ~ \ C01
|
|
@@ -386,9 +402,11 @@ namespace of the process which is attempting the migration. If either
|
|
is not reachable, the migration is rejected with -ENOENT.
|
|
is not reachable, the migration is rejected with -ENOENT.
|
|
|
|
|
|
|
|
|
|
-2-6. Guidelines
|
|
|
|
|
|
+Guidelines
|
|
|
|
+----------
|
|
|
|
|
|
-2-6-1. Organize Once and Control
|
|
|
|
|
|
+Organize Once and Control
|
|
|
|
+~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
Migrating a process across cgroups is a relatively expensive operation
|
|
Migrating a process across cgroups is a relatively expensive operation
|
|
and stateful resources such as memory are not moved together with the
|
|
and stateful resources such as memory are not moved together with the
|
|
@@ -404,7 +422,8 @@ distribution can be made by changing controller configuration through
|
|
the interface files.
|
|
the interface files.
|
|
|
|
|
|
|
|
|
|
-2-6-2. Avoid Name Collisions
|
|
|
|
|
|
+Avoid Name Collisions
|
|
|
|
+~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
Interface files for a cgroup and its children cgroups occupy the same
|
|
Interface files for a cgroup and its children cgroups occupy the same
|
|
directory and it is possible to create children cgroups which collide
|
|
directory and it is possible to create children cgroups which collide
|
|
@@ -422,14 +441,16 @@ cgroup doesn't do anything to prevent name collisions and it's the
|
|
user's responsibility to avoid them.
|
|
user's responsibility to avoid them.
|
|
|
|
|
|
|
|
|
|
-3. Resource Distribution Models
|
|
|
|
|
|
+Resource Distribution Models
|
|
|
|
+============================
|
|
|
|
|
|
cgroup controllers implement several resource distribution schemes
|
|
cgroup controllers implement several resource distribution schemes
|
|
depending on the resource type and expected use cases. This section
|
|
depending on the resource type and expected use cases. This section
|
|
describes major schemes in use along with their expected behaviors.
|
|
describes major schemes in use along with their expected behaviors.
|
|
|
|
|
|
|
|
|
|
-3-1. Weights
|
|
|
|
|
|
+Weights
|
|
|
|
+-------
|
|
|
|
|
|
A parent's resource is distributed by adding up the weights of all
|
|
A parent's resource is distributed by adding up the weights of all
|
|
active children and giving each the fraction matching the ratio of its
|
|
active children and giving each the fraction matching the ratio of its
|
|
@@ -450,7 +471,8 @@ process migrations.
|
|
and is an example of this type.
|
|
and is an example of this type.
|
|
|
|
|
|
|
|
|
|
-3-2. Limits
|
|
|
|
|
|
+Limits
|
|
|
|
+------
|
|
|
|
|
|
A child can only consume upto the configured amount of the resource.
|
|
A child can only consume upto the configured amount of the resource.
|
|
Limits can be over-committed - the sum of the limits of children can
|
|
Limits can be over-committed - the sum of the limits of children can
|
|
@@ -466,7 +488,8 @@ process migrations.
|
|
on an IO device and is an example of this type.
|
|
on an IO device and is an example of this type.
|
|
|
|
|
|
|
|
|
|
-3-3. Protections
|
|
|
|
|
|
+Protections
|
|
|
|
+-----------
|
|
|
|
|
|
A cgroup is protected to be allocated upto the configured amount of
|
|
A cgroup is protected to be allocated upto the configured amount of
|
|
the resource if the usages of all its ancestors are under their
|
|
the resource if the usages of all its ancestors are under their
|
|
@@ -486,7 +509,8 @@ process migrations.
|
|
example of this type.
|
|
example of this type.
|
|
|
|
|
|
|
|
|
|
-3-4. Allocations
|
|
|
|
|
|
+Allocations
|
|
|
|
+-----------
|
|
|
|
|
|
A cgroup is exclusively allocated a certain amount of a finite
|
|
A cgroup is exclusively allocated a certain amount of a finite
|
|
resource. Allocations can't be over-committed - the sum of the
|
|
resource. Allocations can't be over-committed - the sum of the
|
|
@@ -505,12 +529,14 @@ may be rejected.
|
|
type.
|
|
type.
|
|
|
|
|
|
|
|
|
|
-4. Interface Files
|
|
|
|
|
|
+Interface Files
|
|
|
|
+===============
|
|
|
|
|
|
-4-1. Format
|
|
|
|
|
|
+Format
|
|
|
|
+------
|
|
|
|
|
|
All interface files should be in one of the following formats whenever
|
|
All interface files should be in one of the following formats whenever
|
|
-possible.
|
|
|
|
|
|
+possible::
|
|
|
|
|
|
New-line separated values
|
|
New-line separated values
|
|
(when only one value can be written at once)
|
|
(when only one value can be written at once)
|
|
@@ -545,7 +571,8 @@ can be written at a time. For nested keyed files, the sub key pairs
|
|
may be specified in any order and not all pairs have to be specified.
|
|
may be specified in any order and not all pairs have to be specified.
|
|
|
|
|
|
|
|
|
|
-4-2. Conventions
|
|
|
|
|
|
+Conventions
|
|
|
|
+-----------
|
|
|
|
|
|
- Settings for a single feature should be contained in a single file.
|
|
- Settings for a single feature should be contained in a single file.
|
|
|
|
|
|
@@ -581,25 +608,25 @@ may be specified in any order and not all pairs have to be specified.
|
|
with "default" as the value must not appear when read.
|
|
with "default" as the value must not appear when read.
|
|
|
|
|
|
For example, a setting which is keyed by major:minor device numbers
|
|
For example, a setting which is keyed by major:minor device numbers
|
|
- with integer values may look like the following.
|
|
|
|
|
|
+ with integer values may look like the following::
|
|
|
|
|
|
# cat cgroup-example-interface-file
|
|
# cat cgroup-example-interface-file
|
|
default 150
|
|
default 150
|
|
8:0 300
|
|
8:0 300
|
|
|
|
|
|
- The default value can be updated by
|
|
|
|
|
|
+ The default value can be updated by::
|
|
|
|
|
|
# echo 125 > cgroup-example-interface-file
|
|
# echo 125 > cgroup-example-interface-file
|
|
|
|
|
|
- or
|
|
|
|
|
|
+ or::
|
|
|
|
|
|
# echo "default 125" > cgroup-example-interface-file
|
|
# echo "default 125" > cgroup-example-interface-file
|
|
|
|
|
|
- An override can be set by
|
|
|
|
|
|
+ An override can be set by::
|
|
|
|
|
|
# echo "8:16 170" > cgroup-example-interface-file
|
|
# echo "8:16 170" > cgroup-example-interface-file
|
|
|
|
|
|
- and cleared by
|
|
|
|
|
|
+ and cleared by::
|
|
|
|
|
|
# echo "8:0 default" > cgroup-example-interface-file
|
|
# echo "8:0 default" > cgroup-example-interface-file
|
|
# cat cgroup-example-interface-file
|
|
# cat cgroup-example-interface-file
|
|
@@ -612,12 +639,12 @@ may be specified in any order and not all pairs have to be specified.
|
|
generated on the file.
|
|
generated on the file.
|
|
|
|
|
|
|
|
|
|
-4-3. Core Interface Files
|
|
|
|
|
|
+Core Interface Files
|
|
|
|
+--------------------
|
|
|
|
|
|
All cgroup core files are prefixed with "cgroup."
|
|
All cgroup core files are prefixed with "cgroup."
|
|
|
|
|
|
cgroup.procs
|
|
cgroup.procs
|
|
-
|
|
|
|
A read-write new-line separated values file which exists on
|
|
A read-write new-line separated values file which exists on
|
|
all cgroups.
|
|
all cgroups.
|
|
|
|
|
|
@@ -643,7 +670,6 @@ All cgroup core files are prefixed with "cgroup."
|
|
should be granted along with the containing directory.
|
|
should be granted along with the containing directory.
|
|
|
|
|
|
cgroup.controllers
|
|
cgroup.controllers
|
|
-
|
|
|
|
A read-only space separated values file which exists on all
|
|
A read-only space separated values file which exists on all
|
|
cgroups.
|
|
cgroups.
|
|
|
|
|
|
@@ -651,7 +677,6 @@ All cgroup core files are prefixed with "cgroup."
|
|
the cgroup. The controllers are not ordered.
|
|
the cgroup. The controllers are not ordered.
|
|
|
|
|
|
cgroup.subtree_control
|
|
cgroup.subtree_control
|
|
-
|
|
|
|
A read-write space separated values file which exists on all
|
|
A read-write space separated values file which exists on all
|
|
cgroups. Starts out empty.
|
|
cgroups. Starts out empty.
|
|
|
|
|
|
@@ -667,23 +692,25 @@ All cgroup core files are prefixed with "cgroup."
|
|
operations are specified, either all succeed or all fail.
|
|
operations are specified, either all succeed or all fail.
|
|
|
|
|
|
cgroup.events
|
|
cgroup.events
|
|
-
|
|
|
|
A read-only flat-keyed file which exists on non-root cgroups.
|
|
A read-only flat-keyed file which exists on non-root cgroups.
|
|
The following entries are defined. Unless specified
|
|
The following entries are defined. Unless specified
|
|
otherwise, a value change in this file generates a file
|
|
otherwise, a value change in this file generates a file
|
|
modified event.
|
|
modified event.
|
|
|
|
|
|
populated
|
|
populated
|
|
-
|
|
|
|
1 if the cgroup or its descendants contains any live
|
|
1 if the cgroup or its descendants contains any live
|
|
processes; otherwise, 0.
|
|
processes; otherwise, 0.
|
|
|
|
|
|
|
|
|
|
-5. Controllers
|
|
|
|
|
|
+Controllers
|
|
|
|
+===========
|
|
|
|
|
|
-5-1. CPU
|
|
|
|
|
|
+CPU
|
|
|
|
+---
|
|
|
|
|
|
-[NOTE: The interface for the cpu controller hasn't been merged yet]
|
|
|
|
|
|
+.. note::
|
|
|
|
+
|
|
|
|
+ The interface for the cpu controller hasn't been merged yet
|
|
|
|
|
|
The "cpu" controllers regulates distribution of CPU cycles. This
|
|
The "cpu" controllers regulates distribution of CPU cycles. This
|
|
controller implements weight and absolute bandwidth limit models for
|
|
controller implements weight and absolute bandwidth limit models for
|
|
@@ -691,36 +718,34 @@ normal scheduling policy and absolute bandwidth allocation model for
|
|
realtime scheduling policy.
|
|
realtime scheduling policy.
|
|
|
|
|
|
|
|
|
|
-5-1-1. CPU Interface Files
|
|
|
|
|
|
+CPU Interface Files
|
|
|
|
+~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
All time durations are in microseconds.
|
|
All time durations are in microseconds.
|
|
|
|
|
|
cpu.stat
|
|
cpu.stat
|
|
-
|
|
|
|
A read-only flat-keyed file which exists on non-root cgroups.
|
|
A read-only flat-keyed file which exists on non-root cgroups.
|
|
|
|
|
|
- It reports the following six stats.
|
|
|
|
|
|
+ It reports the following six stats:
|
|
|
|
|
|
- usage_usec
|
|
|
|
- user_usec
|
|
|
|
- system_usec
|
|
|
|
- nr_periods
|
|
|
|
- nr_throttled
|
|
|
|
- throttled_usec
|
|
|
|
|
|
+ - usage_usec
|
|
|
|
+ - user_usec
|
|
|
|
+ - system_usec
|
|
|
|
+ - nr_periods
|
|
|
|
+ - nr_throttled
|
|
|
|
+ - throttled_usec
|
|
|
|
|
|
cpu.weight
|
|
cpu.weight
|
|
-
|
|
|
|
A read-write single value file which exists on non-root
|
|
A read-write single value file which exists on non-root
|
|
cgroups. The default is "100".
|
|
cgroups. The default is "100".
|
|
|
|
|
|
The weight in the range [1, 10000].
|
|
The weight in the range [1, 10000].
|
|
|
|
|
|
cpu.max
|
|
cpu.max
|
|
-
|
|
|
|
A read-write two value file which exists on non-root cgroups.
|
|
A read-write two value file which exists on non-root cgroups.
|
|
The default is "max 100000".
|
|
The default is "max 100000".
|
|
|
|
|
|
- The maximum bandwidth limit. It's in the following format.
|
|
|
|
|
|
+ The maximum bandwidth limit. It's in the following format::
|
|
|
|
|
|
$MAX $PERIOD
|
|
$MAX $PERIOD
|
|
|
|
|
|
@@ -729,9 +754,10 @@ All time durations are in microseconds.
|
|
one number is written, $MAX is updated.
|
|
one number is written, $MAX is updated.
|
|
|
|
|
|
cpu.rt.max
|
|
cpu.rt.max
|
|
|
|
+ .. note::
|
|
|
|
|
|
- [NOTE: The semantics of this file is still under discussion and the
|
|
|
|
- interface hasn't been merged yet]
|
|
|
|
|
|
+ The semantics of this file is still under discussion and the
|
|
|
|
+ interface hasn't been merged yet
|
|
|
|
|
|
A read-write two value file which exists on all cgroups.
|
|
A read-write two value file which exists on all cgroups.
|
|
The default is "0 100000".
|
|
The default is "0 100000".
|
|
@@ -739,7 +765,7 @@ All time durations are in microseconds.
|
|
The maximum realtime runtime allocation. Over-committing
|
|
The maximum realtime runtime allocation. Over-committing
|
|
configurations are disallowed and process migrations are
|
|
configurations are disallowed and process migrations are
|
|
rejected if not enough bandwidth is available. It's in the
|
|
rejected if not enough bandwidth is available. It's in the
|
|
- following format.
|
|
|
|
|
|
+ following format::
|
|
|
|
|
|
$MAX $PERIOD
|
|
$MAX $PERIOD
|
|
|
|
|
|
@@ -748,7 +774,8 @@ All time durations are in microseconds.
|
|
updated.
|
|
updated.
|
|
|
|
|
|
|
|
|
|
-5-2. Memory
|
|
|
|
|
|
+Memory
|
|
|
|
+------
|
|
|
|
|
|
The "memory" controller regulates distribution of memory. Memory is
|
|
The "memory" controller regulates distribution of memory. Memory is
|
|
stateful and implements both limit and protection models. Due to the
|
|
stateful and implements both limit and protection models. Due to the
|
|
@@ -770,14 +797,14 @@ following types of memory usages are tracked.
|
|
The above list may expand in the future for better coverage.
|
|
The above list may expand in the future for better coverage.
|
|
|
|
|
|
|
|
|
|
-5-2-1. Memory Interface Files
|
|
|
|
|
|
+Memory Interface Files
|
|
|
|
+~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
All memory amounts are in bytes. If a value which is not aligned to
|
|
All memory amounts are in bytes. If a value which is not aligned to
|
|
PAGE_SIZE is written, the value may be rounded up to the closest
|
|
PAGE_SIZE is written, the value may be rounded up to the closest
|
|
PAGE_SIZE multiple when read back.
|
|
PAGE_SIZE multiple when read back.
|
|
|
|
|
|
memory.current
|
|
memory.current
|
|
-
|
|
|
|
A read-only single value file which exists on non-root
|
|
A read-only single value file which exists on non-root
|
|
cgroups.
|
|
cgroups.
|
|
|
|
|
|
@@ -785,7 +812,6 @@ PAGE_SIZE multiple when read back.
|
|
and its descendants.
|
|
and its descendants.
|
|
|
|
|
|
memory.low
|
|
memory.low
|
|
-
|
|
|
|
A read-write single value file which exists on non-root
|
|
A read-write single value file which exists on non-root
|
|
cgroups. The default is "0".
|
|
cgroups. The default is "0".
|
|
|
|
|
|
@@ -798,7 +824,6 @@ PAGE_SIZE multiple when read back.
|
|
protection is discouraged.
|
|
protection is discouraged.
|
|
|
|
|
|
memory.high
|
|
memory.high
|
|
-
|
|
|
|
A read-write single value file which exists on non-root
|
|
A read-write single value file which exists on non-root
|
|
cgroups. The default is "max".
|
|
cgroups. The default is "max".
|
|
|
|
|
|
@@ -811,7 +836,6 @@ PAGE_SIZE multiple when read back.
|
|
under extreme conditions the limit may be breached.
|
|
under extreme conditions the limit may be breached.
|
|
|
|
|
|
memory.max
|
|
memory.max
|
|
-
|
|
|
|
A read-write single value file which exists on non-root
|
|
A read-write single value file which exists on non-root
|
|
cgroups. The default is "max".
|
|
cgroups. The default is "max".
|
|
|
|
|
|
@@ -826,21 +850,18 @@ PAGE_SIZE multiple when read back.
|
|
utility is limited to providing the final safety net.
|
|
utility is limited to providing the final safety net.
|
|
|
|
|
|
memory.events
|
|
memory.events
|
|
-
|
|
|
|
A read-only flat-keyed file which exists on non-root cgroups.
|
|
A read-only flat-keyed file which exists on non-root cgroups.
|
|
The following entries are defined. Unless specified
|
|
The following entries are defined. Unless specified
|
|
otherwise, a value change in this file generates a file
|
|
otherwise, a value change in this file generates a file
|
|
modified event.
|
|
modified event.
|
|
|
|
|
|
low
|
|
low
|
|
-
|
|
|
|
The number of times the cgroup is reclaimed due to
|
|
The number of times the cgroup is reclaimed due to
|
|
high memory pressure even though its usage is under
|
|
high memory pressure even though its usage is under
|
|
the low boundary. This usually indicates that the low
|
|
the low boundary. This usually indicates that the low
|
|
boundary is over-committed.
|
|
boundary is over-committed.
|
|
|
|
|
|
high
|
|
high
|
|
-
|
|
|
|
The number of times processes of the cgroup are
|
|
The number of times processes of the cgroup are
|
|
throttled and routed to perform direct memory reclaim
|
|
throttled and routed to perform direct memory reclaim
|
|
because the high memory boundary was exceeded. For a
|
|
because the high memory boundary was exceeded. For a
|
|
@@ -849,13 +870,11 @@ PAGE_SIZE multiple when read back.
|
|
occurrences are expected.
|
|
occurrences are expected.
|
|
|
|
|
|
max
|
|
max
|
|
-
|
|
|
|
The number of times the cgroup's memory usage was
|
|
The number of times the cgroup's memory usage was
|
|
about to go over the max boundary. If direct reclaim
|
|
about to go over the max boundary. If direct reclaim
|
|
fails to bring it down, the cgroup goes to OOM state.
|
|
fails to bring it down, the cgroup goes to OOM state.
|
|
|
|
|
|
oom
|
|
oom
|
|
-
|
|
|
|
The number of time the cgroup's memory usage was
|
|
The number of time the cgroup's memory usage was
|
|
reached the limit and allocation was about to fail.
|
|
reached the limit and allocation was about to fail.
|
|
|
|
|
|
@@ -864,16 +883,14 @@ PAGE_SIZE multiple when read back.
|
|
|
|
|
|
Failed allocation in its turn could be returned into
|
|
Failed allocation in its turn could be returned into
|
|
userspace as -ENOMEM or siletly ignored in cases like
|
|
userspace as -ENOMEM or siletly ignored in cases like
|
|
- disk readahead. For now OOM in memory cgroup kills
|
|
|
|
|
|
+ disk readahead. For now OOM in memory cgroup kills
|
|
tasks iff shortage has happened inside page fault.
|
|
tasks iff shortage has happened inside page fault.
|
|
|
|
|
|
oom_kill
|
|
oom_kill
|
|
-
|
|
|
|
The number of processes belonging to this cgroup
|
|
The number of processes belonging to this cgroup
|
|
killed by any kind of OOM killer.
|
|
killed by any kind of OOM killer.
|
|
|
|
|
|
memory.stat
|
|
memory.stat
|
|
-
|
|
|
|
A read-only flat-keyed file which exists on non-root cgroups.
|
|
A read-only flat-keyed file which exists on non-root cgroups.
|
|
|
|
|
|
This breaks down the cgroup's memory footprint into different
|
|
This breaks down the cgroup's memory footprint into different
|
|
@@ -887,73 +904,55 @@ PAGE_SIZE multiple when read back.
|
|
fixed position; use the keys to look up specific values!
|
|
fixed position; use the keys to look up specific values!
|
|
|
|
|
|
anon
|
|
anon
|
|
-
|
|
|
|
Amount of memory used in anonymous mappings such as
|
|
Amount of memory used in anonymous mappings such as
|
|
brk(), sbrk(), and mmap(MAP_ANONYMOUS)
|
|
brk(), sbrk(), and mmap(MAP_ANONYMOUS)
|
|
|
|
|
|
file
|
|
file
|
|
-
|
|
|
|
Amount of memory used to cache filesystem data,
|
|
Amount of memory used to cache filesystem data,
|
|
including tmpfs and shared memory.
|
|
including tmpfs and shared memory.
|
|
|
|
|
|
kernel_stack
|
|
kernel_stack
|
|
-
|
|
|
|
Amount of memory allocated to kernel stacks.
|
|
Amount of memory allocated to kernel stacks.
|
|
|
|
|
|
slab
|
|
slab
|
|
-
|
|
|
|
Amount of memory used for storing in-kernel data
|
|
Amount of memory used for storing in-kernel data
|
|
structures.
|
|
structures.
|
|
|
|
|
|
sock
|
|
sock
|
|
-
|
|
|
|
Amount of memory used in network transmission buffers
|
|
Amount of memory used in network transmission buffers
|
|
|
|
|
|
shmem
|
|
shmem
|
|
-
|
|
|
|
Amount of cached filesystem data that is swap-backed,
|
|
Amount of cached filesystem data that is swap-backed,
|
|
such as tmpfs, shm segments, shared anonymous mmap()s
|
|
such as tmpfs, shm segments, shared anonymous mmap()s
|
|
|
|
|
|
file_mapped
|
|
file_mapped
|
|
-
|
|
|
|
Amount of cached filesystem data mapped with mmap()
|
|
Amount of cached filesystem data mapped with mmap()
|
|
|
|
|
|
file_dirty
|
|
file_dirty
|
|
-
|
|
|
|
Amount of cached filesystem data that was modified but
|
|
Amount of cached filesystem data that was modified but
|
|
not yet written back to disk
|
|
not yet written back to disk
|
|
|
|
|
|
file_writeback
|
|
file_writeback
|
|
-
|
|
|
|
Amount of cached filesystem data that was modified and
|
|
Amount of cached filesystem data that was modified and
|
|
is currently being written back to disk
|
|
is currently being written back to disk
|
|
|
|
|
|
- inactive_anon
|
|
|
|
- active_anon
|
|
|
|
- inactive_file
|
|
|
|
- active_file
|
|
|
|
- unevictable
|
|
|
|
-
|
|
|
|
|
|
+ inactive_anon, active_anon, inactive_file, active_file, unevictable
|
|
Amount of memory, swap-backed and filesystem-backed,
|
|
Amount of memory, swap-backed and filesystem-backed,
|
|
on the internal memory management lists used by the
|
|
on the internal memory management lists used by the
|
|
page reclaim algorithm
|
|
page reclaim algorithm
|
|
|
|
|
|
slab_reclaimable
|
|
slab_reclaimable
|
|
-
|
|
|
|
Part of "slab" that might be reclaimed, such as
|
|
Part of "slab" that might be reclaimed, such as
|
|
dentries and inodes.
|
|
dentries and inodes.
|
|
|
|
|
|
slab_unreclaimable
|
|
slab_unreclaimable
|
|
-
|
|
|
|
Part of "slab" that cannot be reclaimed on memory
|
|
Part of "slab" that cannot be reclaimed on memory
|
|
pressure.
|
|
pressure.
|
|
|
|
|
|
pgfault
|
|
pgfault
|
|
-
|
|
|
|
Total number of page faults incurred
|
|
Total number of page faults incurred
|
|
|
|
|
|
pgmajfault
|
|
pgmajfault
|
|
-
|
|
|
|
Number of major page faults incurred
|
|
Number of major page faults incurred
|
|
|
|
|
|
workingset_refault
|
|
workingset_refault
|
|
@@ -997,7 +996,6 @@ PAGE_SIZE multiple when read back.
|
|
Amount of reclaimed lazyfree pages
|
|
Amount of reclaimed lazyfree pages
|
|
|
|
|
|
memory.swap.current
|
|
memory.swap.current
|
|
-
|
|
|
|
A read-only single value file which exists on non-root
|
|
A read-only single value file which exists on non-root
|
|
cgroups.
|
|
cgroups.
|
|
|
|
|
|
@@ -1005,7 +1003,6 @@ PAGE_SIZE multiple when read back.
|
|
and its descendants.
|
|
and its descendants.
|
|
|
|
|
|
memory.swap.max
|
|
memory.swap.max
|
|
-
|
|
|
|
A read-write single value file which exists on non-root
|
|
A read-write single value file which exists on non-root
|
|
cgroups. The default is "max".
|
|
cgroups. The default is "max".
|
|
|
|
|
|
@@ -1013,7 +1010,8 @@ PAGE_SIZE multiple when read back.
|
|
limit, anonymous meomry of the cgroup will not be swapped out.
|
|
limit, anonymous meomry of the cgroup will not be swapped out.
|
|
|
|
|
|
|
|
|
|
-5-2-2. Usage Guidelines
|
|
|
|
|
|
+Usage Guidelines
|
|
|
|
+~~~~~~~~~~~~~~~~
|
|
|
|
|
|
"memory.high" is the main mechanism to control memory usage.
|
|
"memory.high" is the main mechanism to control memory usage.
|
|
Over-committing on high limit (sum of high limits > available memory)
|
|
Over-committing on high limit (sum of high limits > available memory)
|
|
@@ -1036,7 +1034,8 @@ memory; unfortunately, memory pressure monitoring mechanism isn't
|
|
implemented yet.
|
|
implemented yet.
|
|
|
|
|
|
|
|
|
|
-5-2-3. Memory Ownership
|
|
|
|
|
|
+Memory Ownership
|
|
|
|
+~~~~~~~~~~~~~~~~
|
|
|
|
|
|
A memory area is charged to the cgroup which instantiated it and stays
|
|
A memory area is charged to the cgroup which instantiated it and stays
|
|
charged to the cgroup until the area is released. Migrating a process
|
|
charged to the cgroup until the area is released. Migrating a process
|
|
@@ -1054,7 +1053,8 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
|
|
belonging to the affected files to ensure correct memory ownership.
|
|
belonging to the affected files to ensure correct memory ownership.
|
|
|
|
|
|
|
|
|
|
-5-3. IO
|
|
|
|
|
|
+IO
|
|
|
|
+--
|
|
|
|
|
|
The "io" controller regulates the distribution of IO resources. This
|
|
The "io" controller regulates the distribution of IO resources. This
|
|
controller implements both weight based and absolute bandwidth or IOPS
|
|
controller implements both weight based and absolute bandwidth or IOPS
|
|
@@ -1063,28 +1063,29 @@ only if cfq-iosched is in use and neither scheme is available for
|
|
blk-mq devices.
|
|
blk-mq devices.
|
|
|
|
|
|
|
|
|
|
-5-3-1. IO Interface Files
|
|
|
|
|
|
+IO Interface Files
|
|
|
|
+~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
io.stat
|
|
io.stat
|
|
-
|
|
|
|
A read-only nested-keyed file which exists on non-root
|
|
A read-only nested-keyed file which exists on non-root
|
|
cgroups.
|
|
cgroups.
|
|
|
|
|
|
Lines are keyed by $MAJ:$MIN device numbers and not ordered.
|
|
Lines are keyed by $MAJ:$MIN device numbers and not ordered.
|
|
The following nested keys are defined.
|
|
The following nested keys are defined.
|
|
|
|
|
|
|
|
+ ====== ===================
|
|
rbytes Bytes read
|
|
rbytes Bytes read
|
|
wbytes Bytes written
|
|
wbytes Bytes written
|
|
rios Number of read IOs
|
|
rios Number of read IOs
|
|
wios Number of write IOs
|
|
wios Number of write IOs
|
|
|
|
+ ====== ===================
|
|
|
|
|
|
- An example read output follows.
|
|
|
|
|
|
+ An example read output follows:
|
|
|
|
|
|
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353
|
|
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353
|
|
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252
|
|
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252
|
|
|
|
|
|
io.weight
|
|
io.weight
|
|
-
|
|
|
|
A read-write flat-keyed file which exists on non-root cgroups.
|
|
A read-write flat-keyed file which exists on non-root cgroups.
|
|
The default is "default 100".
|
|
The default is "default 100".
|
|
|
|
|
|
@@ -1098,14 +1099,13 @@ blk-mq devices.
|
|
$WEIGHT" or simply "$WEIGHT". Overrides can be set by writing
|
|
$WEIGHT" or simply "$WEIGHT". Overrides can be set by writing
|
|
"$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
|
|
"$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
|
|
|
|
|
|
- An example read output follows.
|
|
|
|
|
|
+ An example read output follows::
|
|
|
|
|
|
default 100
|
|
default 100
|
|
8:16 200
|
|
8:16 200
|
|
8:0 50
|
|
8:0 50
|
|
|
|
|
|
io.max
|
|
io.max
|
|
-
|
|
|
|
A read-write nested-keyed file which exists on non-root
|
|
A read-write nested-keyed file which exists on non-root
|
|
cgroups.
|
|
cgroups.
|
|
|
|
|
|
@@ -1113,10 +1113,12 @@ blk-mq devices.
|
|
device numbers and not ordered. The following nested keys are
|
|
device numbers and not ordered. The following nested keys are
|
|
defined.
|
|
defined.
|
|
|
|
|
|
|
|
+ ===== ==================================
|
|
rbps Max read bytes per second
|
|
rbps Max read bytes per second
|
|
wbps Max write bytes per second
|
|
wbps Max write bytes per second
|
|
riops Max read IO operations per second
|
|
riops Max read IO operations per second
|
|
wiops Max write IO operations per second
|
|
wiops Max write IO operations per second
|
|
|
|
+ ===== ==================================
|
|
|
|
|
|
When writing, any number of nested key-value pairs can be
|
|
When writing, any number of nested key-value pairs can be
|
|
specified in any order. "max" can be specified as the value
|
|
specified in any order. "max" can be specified as the value
|
|
@@ -1126,24 +1128,25 @@ blk-mq devices.
|
|
BPS and IOPS are measured in each IO direction and IOs are
|
|
BPS and IOPS are measured in each IO direction and IOs are
|
|
delayed if limit is reached. Temporary bursts are allowed.
|
|
delayed if limit is reached. Temporary bursts are allowed.
|
|
|
|
|
|
- Setting read limit at 2M BPS and write at 120 IOPS for 8:16.
|
|
|
|
|
|
+ Setting read limit at 2M BPS and write at 120 IOPS for 8:16::
|
|
|
|
|
|
echo "8:16 rbps=2097152 wiops=120" > io.max
|
|
echo "8:16 rbps=2097152 wiops=120" > io.max
|
|
|
|
|
|
- Reading returns the following.
|
|
|
|
|
|
+ Reading returns the following::
|
|
|
|
|
|
8:16 rbps=2097152 wbps=max riops=max wiops=120
|
|
8:16 rbps=2097152 wbps=max riops=max wiops=120
|
|
|
|
|
|
- Write IOPS limit can be removed by writing the following.
|
|
|
|
|
|
+ Write IOPS limit can be removed by writing the following::
|
|
|
|
|
|
echo "8:16 wiops=max" > io.max
|
|
echo "8:16 wiops=max" > io.max
|
|
|
|
|
|
- Reading now returns the following.
|
|
|
|
|
|
+ Reading now returns the following::
|
|
|
|
|
|
8:16 rbps=2097152 wbps=max riops=max wiops=max
|
|
8:16 rbps=2097152 wbps=max riops=max wiops=max
|
|
|
|
|
|
|
|
|
|
-5-3-2. Writeback
|
|
|
|
|
|
+Writeback
|
|
|
|
+~~~~~~~~~
|
|
|
|
|
|
Page cache is dirtied through buffered writes and shared mmaps and
|
|
Page cache is dirtied through buffered writes and shared mmaps and
|
|
written asynchronously to the backing filesystem by the writeback
|
|
written asynchronously to the backing filesystem by the writeback
|
|
@@ -1191,22 +1194,19 @@ patterns.
|
|
The sysctl knobs which affect writeback behavior are applied to cgroup
|
|
The sysctl knobs which affect writeback behavior are applied to cgroup
|
|
writeback as follows.
|
|
writeback as follows.
|
|
|
|
|
|
- vm.dirty_background_ratio
|
|
|
|
- vm.dirty_ratio
|
|
|
|
-
|
|
|
|
|
|
+ vm.dirty_background_ratio, vm.dirty_ratio
|
|
These ratios apply the same to cgroup writeback with the
|
|
These ratios apply the same to cgroup writeback with the
|
|
amount of available memory capped by limits imposed by the
|
|
amount of available memory capped by limits imposed by the
|
|
memory controller and system-wide clean memory.
|
|
memory controller and system-wide clean memory.
|
|
|
|
|
|
- vm.dirty_background_bytes
|
|
|
|
- vm.dirty_bytes
|
|
|
|
-
|
|
|
|
|
|
+ vm.dirty_background_bytes, vm.dirty_bytes
|
|
For cgroup writeback, this is calculated into ratio against
|
|
For cgroup writeback, this is calculated into ratio against
|
|
total available memory and applied the same way as
|
|
total available memory and applied the same way as
|
|
vm.dirty[_background]_ratio.
|
|
vm.dirty[_background]_ratio.
|
|
|
|
|
|
|
|
|
|
-5-4. PID
|
|
|
|
|
|
+PID
|
|
|
|
+---
|
|
|
|
|
|
The process number controller is used to allow a cgroup to stop any
|
|
The process number controller is used to allow a cgroup to stop any
|
|
new tasks from being fork()'d or clone()'d after a specified limit is
|
|
new tasks from being fork()'d or clone()'d after a specified limit is
|
|
@@ -1221,17 +1221,16 @@ Note that PIDs used in this controller refer to TIDs, process IDs as
|
|
used by the kernel.
|
|
used by the kernel.
|
|
|
|
|
|
|
|
|
|
-5-4-1. PID Interface Files
|
|
|
|
|
|
+PID Interface Files
|
|
|
|
+~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
pids.max
|
|
pids.max
|
|
-
|
|
|
|
A read-write single value file which exists on non-root
|
|
A read-write single value file which exists on non-root
|
|
cgroups. The default is "max".
|
|
cgroups. The default is "max".
|
|
|
|
|
|
Hard limit of number of processes.
|
|
Hard limit of number of processes.
|
|
|
|
|
|
pids.current
|
|
pids.current
|
|
-
|
|
|
|
A read-only single value file which exists on all cgroups.
|
|
A read-only single value file which exists on all cgroups.
|
|
|
|
|
|
The number of processes currently in the cgroup and its
|
|
The number of processes currently in the cgroup and its
|
|
@@ -1246,12 +1245,14 @@ through fork() or clone(). These will return -EAGAIN if the creation
|
|
of a new process would cause a cgroup policy to be violated.
|
|
of a new process would cause a cgroup policy to be violated.
|
|
|
|
|
|
|
|
|
|
-5-5. RDMA
|
|
|
|
|
|
+RDMA
|
|
|
|
+----
|
|
|
|
|
|
The "rdma" controller regulates the distribution and accounting of
|
|
The "rdma" controller regulates the distribution and accounting of
|
|
of RDMA resources.
|
|
of RDMA resources.
|
|
|
|
|
|
-5-5-1. RDMA Interface Files
|
|
|
|
|
|
+RDMA Interface Files
|
|
|
|
+~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
rdma.max
|
|
rdma.max
|
|
A readwrite nested-keyed file that exists for all the cgroups
|
|
A readwrite nested-keyed file that exists for all the cgroups
|
|
@@ -1264,10 +1265,12 @@ of RDMA resources.
|
|
|
|
|
|
The following nested keys are defined.
|
|
The following nested keys are defined.
|
|
|
|
|
|
|
|
+ ========== =============================
|
|
hca_handle Maximum number of HCA Handles
|
|
hca_handle Maximum number of HCA Handles
|
|
hca_object Maximum number of HCA Objects
|
|
hca_object Maximum number of HCA Objects
|
|
|
|
+ ========== =============================
|
|
|
|
|
|
- An example for mlx4 and ocrdma device follows.
|
|
|
|
|
|
+ An example for mlx4 and ocrdma device follows::
|
|
|
|
|
|
mlx4_0 hca_handle=2 hca_object=2000
|
|
mlx4_0 hca_handle=2 hca_object=2000
|
|
ocrdma1 hca_handle=3 hca_object=max
|
|
ocrdma1 hca_handle=3 hca_object=max
|
|
@@ -1276,15 +1279,17 @@ of RDMA resources.
|
|
A read-only file that describes current resource usage.
|
|
A read-only file that describes current resource usage.
|
|
It exists for all the cgroup except root.
|
|
It exists for all the cgroup except root.
|
|
|
|
|
|
- An example for mlx4 and ocrdma device follows.
|
|
|
|
|
|
+ An example for mlx4 and ocrdma device follows::
|
|
|
|
|
|
mlx4_0 hca_handle=1 hca_object=20
|
|
mlx4_0 hca_handle=1 hca_object=20
|
|
ocrdma1 hca_handle=1 hca_object=23
|
|
ocrdma1 hca_handle=1 hca_object=23
|
|
|
|
|
|
|
|
|
|
-5-6. Misc
|
|
|
|
|
|
+Misc
|
|
|
|
+----
|
|
|
|
|
|
-5-6-1. perf_event
|
|
|
|
|
|
+perf_event
|
|
|
|
+~~~~~~~~~~
|
|
|
|
|
|
perf_event controller, if not mounted on a legacy hierarchy, is
|
|
perf_event controller, if not mounted on a legacy hierarchy, is
|
|
automatically enabled on the v2 hierarchy so that perf events can
|
|
automatically enabled on the v2 hierarchy so that perf events can
|
|
@@ -1292,9 +1297,11 @@ always be filtered by cgroup v2 path. The controller can still be
|
|
moved to a legacy hierarchy after v2 hierarchy is populated.
|
|
moved to a legacy hierarchy after v2 hierarchy is populated.
|
|
|
|
|
|
|
|
|
|
-6. Namespace
|
|
|
|
|
|
+Namespace
|
|
|
|
+=========
|
|
|
|
|
|
-6-1. Basics
|
|
|
|
|
|
+Basics
|
|
|
|
+------
|
|
|
|
|
|
cgroup namespace provides a mechanism to virtualize the view of the
|
|
cgroup namespace provides a mechanism to virtualize the view of the
|
|
"/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone
|
|
"/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone
|
|
@@ -1308,7 +1315,7 @@ Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
|
|
complete path of the cgroup of a process. In a container setup where
|
|
complete path of the cgroup of a process. In a container setup where
|
|
a set of cgroups and namespaces are intended to isolate processes the
|
|
a set of cgroups and namespaces are intended to isolate processes the
|
|
"/proc/$PID/cgroup" file may leak potential system level information
|
|
"/proc/$PID/cgroup" file may leak potential system level information
|
|
-to the isolated processes. For Example:
|
|
|
|
|
|
+to the isolated processes. For Example::
|
|
|
|
|
|
# cat /proc/self/cgroup
|
|
# cat /proc/self/cgroup
|
|
0::/batchjobs/container_id1
|
|
0::/batchjobs/container_id1
|
|
@@ -1316,14 +1323,14 @@ to the isolated processes. For Example:
|
|
The path '/batchjobs/container_id1' can be considered as system-data
|
|
The path '/batchjobs/container_id1' can be considered as system-data
|
|
and undesirable to expose to the isolated processes. cgroup namespace
|
|
and undesirable to expose to the isolated processes. cgroup namespace
|
|
can be used to restrict visibility of this path. For example, before
|
|
can be used to restrict visibility of this path. For example, before
|
|
-creating a cgroup namespace, one would see:
|
|
|
|
|
|
+creating a cgroup namespace, one would see::
|
|
|
|
|
|
# ls -l /proc/self/ns/cgroup
|
|
# ls -l /proc/self/ns/cgroup
|
|
lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
|
|
lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
|
|
# cat /proc/self/cgroup
|
|
# cat /proc/self/cgroup
|
|
0::/batchjobs/container_id1
|
|
0::/batchjobs/container_id1
|
|
|
|
|
|
-After unsharing a new namespace, the view changes.
|
|
|
|
|
|
+After unsharing a new namespace, the view changes::
|
|
|
|
|
|
# ls -l /proc/self/ns/cgroup
|
|
# ls -l /proc/self/ns/cgroup
|
|
lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
|
|
lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
|
|
@@ -1341,7 +1348,8 @@ namespace is destroyed. The cgroupns root and the actual cgroups
|
|
remain.
|
|
remain.
|
|
|
|
|
|
|
|
|
|
-6-2. The Root and Views
|
|
|
|
|
|
+The Root and Views
|
|
|
|
+------------------
|
|
|
|
|
|
The 'cgroupns root' for a cgroup namespace is the cgroup in which the
|
|
The 'cgroupns root' for a cgroup namespace is the cgroup in which the
|
|
process calling unshare(2) is running. For example, if a process in
|
|
process calling unshare(2) is running. For example, if a process in
|
|
@@ -1350,7 +1358,7 @@ process calling unshare(2) is running. For example, if a process in
|
|
init_cgroup_ns, this is the real root ('/') cgroup.
|
|
init_cgroup_ns, this is the real root ('/') cgroup.
|
|
|
|
|
|
The cgroupns root cgroup does not change even if the namespace creator
|
|
The cgroupns root cgroup does not change even if the namespace creator
|
|
-process later moves to a different cgroup.
|
|
|
|
|
|
+process later moves to a different cgroup::
|
|
|
|
|
|
# ~/unshare -c # unshare cgroupns in some cgroup
|
|
# ~/unshare -c # unshare cgroupns in some cgroup
|
|
# cat /proc/self/cgroup
|
|
# cat /proc/self/cgroup
|
|
@@ -1364,7 +1372,7 @@ Each process gets its namespace-specific view of "/proc/$PID/cgroup"
|
|
|
|
|
|
Processes running inside the cgroup namespace will be able to see
|
|
Processes running inside the cgroup namespace will be able to see
|
|
cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
|
|
cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
|
|
-From within an unshared cgroupns:
|
|
|
|
|
|
+From within an unshared cgroupns::
|
|
|
|
|
|
# sleep 100000 &
|
|
# sleep 100000 &
|
|
[1] 7353
|
|
[1] 7353
|
|
@@ -1373,7 +1381,7 @@ From within an unshared cgroupns:
|
|
0::/sub_cgrp_1
|
|
0::/sub_cgrp_1
|
|
|
|
|
|
From the initial cgroup namespace, the real cgroup path will be
|
|
From the initial cgroup namespace, the real cgroup path will be
|
|
-visible:
|
|
|
|
|
|
+visible::
|
|
|
|
|
|
$ cat /proc/7353/cgroup
|
|
$ cat /proc/7353/cgroup
|
|
0::/batchjobs/container_id1/sub_cgrp_1
|
|
0::/batchjobs/container_id1/sub_cgrp_1
|
|
@@ -1381,7 +1389,7 @@ visible:
|
|
From a sibling cgroup namespace (that is, a namespace rooted at a
|
|
From a sibling cgroup namespace (that is, a namespace rooted at a
|
|
different cgroup), the cgroup path relative to its own cgroup
|
|
different cgroup), the cgroup path relative to its own cgroup
|
|
namespace root will be shown. For instance, if PID 7353's cgroup
|
|
namespace root will be shown. For instance, if PID 7353's cgroup
|
|
-namespace root is at '/batchjobs/container_id2', then it will see
|
|
|
|
|
|
+namespace root is at '/batchjobs/container_id2', then it will see::
|
|
|
|
|
|
# cat /proc/7353/cgroup
|
|
# cat /proc/7353/cgroup
|
|
0::/../container_id2/sub_cgrp_1
|
|
0::/../container_id2/sub_cgrp_1
|
|
@@ -1390,13 +1398,14 @@ Note that the relative path always starts with '/' to indicate that
|
|
its relative to the cgroup namespace root of the caller.
|
|
its relative to the cgroup namespace root of the caller.
|
|
|
|
|
|
|
|
|
|
-6-3. Migration and setns(2)
|
|
|
|
|
|
+Migration and setns(2)
|
|
|
|
+----------------------
|
|
|
|
|
|
Processes inside a cgroup namespace can move into and out of the
|
|
Processes inside a cgroup namespace can move into and out of the
|
|
namespace root if they have proper access to external cgroups. For
|
|
namespace root if they have proper access to external cgroups. For
|
|
example, from inside a namespace with cgroupns root at
|
|
example, from inside a namespace with cgroupns root at
|
|
/batchjobs/container_id1, and assuming that the global hierarchy is
|
|
/batchjobs/container_id1, and assuming that the global hierarchy is
|
|
-still accessible inside cgroupns:
|
|
|
|
|
|
+still accessible inside cgroupns::
|
|
|
|
|
|
# cat /proc/7353/cgroup
|
|
# cat /proc/7353/cgroup
|
|
0::/sub_cgrp_1
|
|
0::/sub_cgrp_1
|
|
@@ -1418,10 +1427,11 @@ namespace. It is expected that the someone moves the attaching
|
|
process under the target cgroup namespace root.
|
|
process under the target cgroup namespace root.
|
|
|
|
|
|
|
|
|
|
-6-4. Interaction with Other Namespaces
|
|
|
|
|
|
+Interaction with Other Namespaces
|
|
|
|
+---------------------------------
|
|
|
|
|
|
Namespace specific cgroup hierarchy can be mounted by a process
|
|
Namespace specific cgroup hierarchy can be mounted by a process
|
|
-running inside a non-init cgroup namespace.
|
|
|
|
|
|
+running inside a non-init cgroup namespace::
|
|
|
|
|
|
# mount -t cgroup2 none $MOUNT_POINT
|
|
# mount -t cgroup2 none $MOUNT_POINT
|
|
|
|
|
|
@@ -1434,27 +1444,27 @@ the view of cgroup hierarchy by namespace-private cgroupfs mount
|
|
provides a properly isolated cgroup view inside the container.
|
|
provides a properly isolated cgroup view inside the container.
|
|
|
|
|
|
|
|
|
|
-P. Information on Kernel Programming
|
|
|
|
|
|
+Information on Kernel Programming
|
|
|
|
+=================================
|
|
|
|
|
|
This section contains kernel programming information in the areas
|
|
This section contains kernel programming information in the areas
|
|
where interacting with cgroup is necessary. cgroup core and
|
|
where interacting with cgroup is necessary. cgroup core and
|
|
controllers are not covered.
|
|
controllers are not covered.
|
|
|
|
|
|
|
|
|
|
-P-1. Filesystem Support for Writeback
|
|
|
|
|
|
+Filesystem Support for Writeback
|
|
|
|
+--------------------------------
|
|
|
|
|
|
A filesystem can support cgroup writeback by updating
|
|
A filesystem can support cgroup writeback by updating
|
|
address_space_operations->writepage[s]() to annotate bio's using the
|
|
address_space_operations->writepage[s]() to annotate bio's using the
|
|
following two functions.
|
|
following two functions.
|
|
|
|
|
|
wbc_init_bio(@wbc, @bio)
|
|
wbc_init_bio(@wbc, @bio)
|
|
-
|
|
|
|
Should be called for each bio carrying writeback data and
|
|
Should be called for each bio carrying writeback data and
|
|
associates the bio with the inode's owner cgroup. Can be
|
|
associates the bio with the inode's owner cgroup. Can be
|
|
called anytime between bio allocation and submission.
|
|
called anytime between bio allocation and submission.
|
|
|
|
|
|
wbc_account_io(@wbc, @page, @bytes)
|
|
wbc_account_io(@wbc, @page, @bytes)
|
|
-
|
|
|
|
Should be called for each data segment being written out.
|
|
Should be called for each data segment being written out.
|
|
While this function doesn't care exactly when it's called
|
|
While this function doesn't care exactly when it's called
|
|
during the writeback session, it's the easiest and most
|
|
during the writeback session, it's the easiest and most
|
|
@@ -1475,7 +1485,8 @@ cases by skipping wbc_init_bio() or using bio_associate_blkcg()
|
|
directly.
|
|
directly.
|
|
|
|
|
|
|
|
|
|
-D. Deprecated v1 Core Features
|
|
|
|
|
|
+Deprecated v1 Core Features
|
|
|
|
+===========================
|
|
|
|
|
|
- Multiple hierarchies including named ones are not supported.
|
|
- Multiple hierarchies including named ones are not supported.
|
|
|
|
|
|
@@ -1489,9 +1500,11 @@ D. Deprecated v1 Core Features
|
|
at the root instead.
|
|
at the root instead.
|
|
|
|
|
|
|
|
|
|
-R. Issues with v1 and Rationales for v2
|
|
|
|
|
|
+Issues with v1 and Rationales for v2
|
|
|
|
+====================================
|
|
|
|
|
|
-R-1. Multiple Hierarchies
|
|
|
|
|
|
+Multiple Hierarchies
|
|
|
|
+--------------------
|
|
|
|
|
|
cgroup v1 allowed an arbitrary number of hierarchies and each
|
|
cgroup v1 allowed an arbitrary number of hierarchies and each
|
|
hierarchy could host any number of controllers. While this seemed to
|
|
hierarchy could host any number of controllers. While this seemed to
|
|
@@ -1543,7 +1556,8 @@ how memory is distributed beyond a certain level while still wanting
|
|
to control how CPU cycles are distributed.
|
|
to control how CPU cycles are distributed.
|
|
|
|
|
|
|
|
|
|
-R-2. Thread Granularity
|
|
|
|
|
|
+Thread Granularity
|
|
|
|
+------------------
|
|
|
|
|
|
cgroup v1 allowed threads of a process to belong to different cgroups.
|
|
cgroup v1 allowed threads of a process to belong to different cgroups.
|
|
This didn't make sense for some controllers and those controllers
|
|
This didn't make sense for some controllers and those controllers
|
|
@@ -1586,7 +1600,8 @@ misbehaving and poorly abstracted interfaces and kernel exposing and
|
|
locked into constructs inadvertently.
|
|
locked into constructs inadvertently.
|
|
|
|
|
|
|
|
|
|
-R-3. Competition Between Inner Nodes and Threads
|
|
|
|
|
|
+Competition Between Inner Nodes and Threads
|
|
|
|
+-------------------------------------------
|
|
|
|
|
|
cgroup v1 allowed threads to be in any cgroups which created an
|
|
cgroup v1 allowed threads to be in any cgroups which created an
|
|
interesting problem where threads belonging to a parent cgroup and its
|
|
interesting problem where threads belonging to a parent cgroup and its
|
|
@@ -1605,7 +1620,7 @@ simply weren't available for threads.
|
|
|
|
|
|
The io controller implicitly created a hidden leaf node for each
|
|
The io controller implicitly created a hidden leaf node for each
|
|
cgroup to host the threads. The hidden leaf had its own copies of all
|
|
cgroup to host the threads. The hidden leaf had its own copies of all
|
|
-the knobs with "leaf_" prefixed. While this allowed equivalent
|
|
|
|
|
|
+the knobs with ``leaf_`` prefixed. While this allowed equivalent
|
|
control over internal threads, it was with serious drawbacks. It
|
|
control over internal threads, it was with serious drawbacks. It
|
|
always added an extra layer of nesting which wouldn't be necessary
|
|
always added an extra layer of nesting which wouldn't be necessary
|
|
otherwise, made the interface messy and significantly complicated the
|
|
otherwise, made the interface messy and significantly complicated the
|
|
@@ -1626,7 +1641,8 @@ This clearly is a problem which needs to be addressed from cgroup core
|
|
in a uniform way.
|
|
in a uniform way.
|
|
|
|
|
|
|
|
|
|
-R-4. Other Interface Issues
|
|
|
|
|
|
+Other Interface Issues
|
|
|
|
+----------------------
|
|
|
|
|
|
cgroup v1 grew without oversight and developed a large number of
|
|
cgroup v1 grew without oversight and developed a large number of
|
|
idiosyncrasies and inconsistencies. One issue on the cgroup core side
|
|
idiosyncrasies and inconsistencies. One issue on the cgroup core side
|
|
@@ -1654,9 +1670,11 @@ cgroup v2 establishes common conventions where appropriate and updates
|
|
controllers so that they expose minimal and consistent interfaces.
|
|
controllers so that they expose minimal and consistent interfaces.
|
|
|
|
|
|
|
|
|
|
-R-5. Controller Issues and Remedies
|
|
|
|
|
|
+Controller Issues and Remedies
|
|
|
|
+------------------------------
|
|
|
|
|
|
-R-5-1. Memory
|
|
|
|
|
|
+Memory
|
|
|
|
+~~~~~~
|
|
|
|
|
|
The original lower boundary, the soft limit, is defined as a limit
|
|
The original lower boundary, the soft limit, is defined as a limit
|
|
that is per default unset. As a result, the set of cgroups that
|
|
that is per default unset. As a result, the set of cgroups that
|