CGroups stand for Control Groups. They were introduced into the kernel by Google in 2006 to restrict resources used by a process. All the resources a process can use have their own resource controller or CGroup subsystem.
Here is the list of the available resource controllers:
- blkio: sets limits on input/output access to and from block devices (see BlockIOWeight);
- cpu: uses the CPU scheduler to provide CGroup tasks an access to the CPU. It is mounted together with the cpuacct controller on the same mount (see CPUShares);
- cpuacct: creates automatic reports on CPU resources used by tasks in a CGroup. It is mounted together with the cpu controller on the same mount (see CPUShares);
- cpuset: assigns individual CPUs (on a multicore system) and memory nodes to tasks in a CGroup;
- devices: allows or denies access to devices for tasks in a CGroup;
- freezer: suspends or resumes tasks in a CGroup;
- memory: sets limits on memory use by tasks in a CGroup, and generates automatic reports on memory resources used by those tasks (see MemoryLimit);
- net_cls: tags network packets with a class identifier (classid) that allows the Linux traffic controller (the tc command) to identify packets originating from a particular CGroup task;
- perf_event: enables monitoring CGroups with the perf tool;
- hugetlb: allows to use virtual memory pages of large sizes, and to enforce resource limits on these pages.
CGroups were already available in RHEL 6. However, with the arrival of Systemd in RHEL 7, many things have changed.
Systemd organizes processes in control groups. For example, all the processes started by an apache webserver will be in the same control group, CGI scripts included. This makes stopping an apache webserver much easier. This also moves the resource management settings from the process level to the application level by binding the system of CGroup hierarchies with the Systemd unit tree.
The Systemd unit tree is made up of several parts:
- at the top, there is the root slice called -.slice,
- below, there are the system.slice (the default place for all system services), the user.slice (the default place for all user sessions) and the machine.slice (the default place for all virtual machines and Linux containers),
- still below there are scopes (group of externally created processes started via fork) and services (group of processes created through a unit file).
Note: If only system services run on a server, they get 100% of the available resources. If users connect to the server, they will get 100% of the available resources minus what the system services use. If both users and system services request 100% of the resources each, they will only get 50%. If there are system services, users and virtual machines and they all request 100% of the resources, they will only get 33% each.
Any kind of slice can get 100% of the resources if nobody else wants them. But if resources are not available as much as one slice would, some limits occur.
For example, to get the full hierarchy of control groups, type:
# systemd-cgls ├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 24 ├─user.slice │ └─user-0.slice │ ├─session-56.scope │ │ ├─19679 sshd: root@pts/1 │ │ ├─19683 -bash │ │ ├─19714 systemd-cgls │ │ └─19715 less │ └─session-40.scope │ ├─19370 sshd: root@pts/0 │ └─19374 -bash └─system.slice ├─httpd.service │ ├─2577 /usr/sbin/httpd -DFOREGROUND │ ├─2578 /usr/sbin/httpd -DFOREGROUND │ └─2579 /usr/sbin/httpd -DFOREGROUND ├─polkit.service │ └─730 /usr/lib/polkit-1/polkitd --no-debug ├─systemd-udevd.service │ └─455 /usr/lib/systemd/systemd-udevd ├─lvm2-lvmetad.service │ └─450 /usr/sbin/lvmetad -f ├─systemd-journald.service │ └─449 /usr/lib/systemd/systemd-journald ├─dbus.service │ └─611 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --sy ├─systemd-logind.service │ └─604 /usr/lib/systemd/systemd-logind ├─chronyd.service │ └─613 /usr/sbin/chronyd -u chrony ├─crond.service │ └─621 /usr/sbin/crond -n ├─postfix.service │ ├─ 1349 /usr/libexec/postfix/master -w │ ├─ 1358 qmgr -l -t unix -u │ └─19596 pickup -l -t unix -u ├─rsyslog.service │ └─589 /usr/sbin/rsyslogd -n ├─sshd.service │ └─1068 /usr/sbin/sshd -D ├─tuned.service │ └─583 /usr/bin/python -Es /usr/sbin/tuned -l -P ├─firewalld.service │ └─580 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid ├─NetworkManager.service │ └─698 /usr/sbin/NetworkManager --no-daemon ├─system-getty.slice │ └─email@example.com │ └─631 /sbin/agetty --noclear tty1 └─system-serial\x2dgetty.slice └─serial-getty@ttyS0.service └─630 /sbin/agetty --keep-baud 115200 38400 9600 ttyS0
To kill all the processes associated with an apache server (CGI scripts included), type:
# systemctl kill httpd
Note: Add the -s option to specify the signal to send (SIGTERM, SIGINT or SIGSTOP; by default SIGTERM).
To get the list of control group ordered by CPU, memory and disk I/O load, type:
# systemd-cgtop Path Tasks %CPU Memory Input/s Output/s / 213 3.9 829.7M - - /system.slice 1 - - - - /system.slice/sshd.service 1 - - - - /system.slice/ModemManager.service 1 - - - -
If you want the sshd service to display the amount of memory currently used through the systemd-cgtop command, you need to activate accounting by creating the /etc/systemd/system/sshd.service.d directory:
# mkdir /etc/systemd/system/sshd.service.d
Then, you need to create the /etc/systemd/system/sshd.service.d/accounting.conf file and paste the following lines into it:
Note: Other options exist like CPUAccounting, BlockIOAccounting or TasksAccounting (since RHEL 7.4).
Finally, you need to type:
# systemctl daemon-reload; systemctl restart sshd
Systemd Resource Controllers
Through Systemd, several resources can be restricted:
- CPUShares: by default at 1024; this requires CPUAccounting=true to be set before,
- MemoryLimit: by default without limit, value expressed in Megabytes or Gigabytes; this requires MemoryAccounting=true to be set before,
- BlockIOWeight, BlockIODeviceWeight: by default without limit, value between 10 and 1000 (requires CFQ IO elevator); this needs BlockIOAccounting=true to be set before.
- BlockIOReadBandwidth, BlockIOWriteBandwidth: by default without limit, value expressed in Megabytes or Gigabytes; this requires BlockIOAccounting=true to be set before.
The RHEL 7.2 release brings three new resource management options:
- StartupCPUShares and StartupBlockIOWeight: they work like CPUShares and BlockIOWeight but only apply during system startup; this requires CPUAccounting=true to be set before,
- CPUQuota: it restricts CPU time to the specified percentage, even if the machine is otherwise idle; this requires CPUAccounting=true to be set before.
The RHEL 7.4 release brings one new resource management option:
- TasksMax=10 defines the maximum number of tasks the unit can create (here 10); this requires TasksAccounting=true to be set before.
To put resource limits on a service (here 500 CPUShares), type:
# systemctl set-property httpd CPUShares=500 # systemctl daemon-reload
Note1: The change is written into the service unit file. Use the –runtime option to avoid this behaviour.
Note2: By default, each service owns 1024 CPUShares. Nothing prevents you from giving a value smaller or bigger.
To get the current CPUShares service value, type:
# systemctl show -p CPUShares httpd CPUShares=500
# systemctl show httpd | grep CPUShares CPUShares=500
Note: Each time a resource limit is set on a service, a directory of the same name with the .d suffix is created in /etc/systemd/system. For example, in the previous case, a directory named /etc/systemd/system/httpd.service.d is created with a file called 90-CPUShares.conf in it and the following content:
Note: The newly created directory (here /etc/systemd/system/httpd.service.d) can also be used to customize the service configuration file.
Also, if you need to use RT (Real-Time) services, be ready to apply additional RT configurations.
The libcgroup-tools package gives access to some useful tools for manipulating control groups.
Some Other Examples
As previously seen, system processes get 1024 CPUShares, user processes 1024 CPUShares and virtual machines 1024 CPUShares, which means 33% of CPU each by default (you will only see it if each category executes some tasks).
To allocate 70% of CPU to the system processes, 20% of CPU to the user processes and 10% of CPU to the virtual machines, type:
# systemctl set-property system.slice CPUShares=7168 # systemctl set-property user.slice CPUShares=2048 # systemctl set-property machine.slice CPUShares=1024
Note: Rebooting might be necessary to see the changes.
To restrict the user with the uid 1000 to use less than 20% of cpu, type:
# systemctl set-property user-1000.slice CPUQuota=20%
To reduce the memory available for the same user to 1GB, type:
# systemctl set-property user-1000.slice MemoryLimit=1024M
To limit the mariadb service to write below 2MB/s onto the /dev/vdb partition, type:
# systemctl set-property mariadb.service BlockIOWriteBandwidth="/dev/vdb 2M"
To better understand CGroups, let’s take an example. You want to run a website but you’ve got only one server.
You plan to use the classical LAMP stack (Linux, here Centos 7, Apache, MariaDB and PHP).
Your server’s got 4Gigabytes of memory and you want to allocate resources as follows:
- Apache service (here httpd.service): 40% of CPU, 500M of memory,
- PHP service (here php-fpm.service): 30% of CPU, 1G of memory,
- MariaDB service (here mariadb.service): 30% of CPU, 1G of memory.
You leave 1G of memory for the other processes (system, etc).
Note1: The values given are only for the sake of the discussion.
Note2: If you don’t configure CGroups, everything will work like in RHEL 6: all the processes will share the server power and the memory as they need.
As all your LAMP services are started from a Systemd unit file, they will be added in the system.slice.
Here is the configuration to set up with the systemctl set-property command:
- Apache service: CPUShares=4096 (4 x 1024); MemoryLimit=500M,
- PHP service: CPUShares=3072 (3 x 1024); MemoryLimit=1G,
- MariaDB service: CPUShares=3072 (3 x 1024); MemoryLimit=1G,
Note1: The Apache service will get 4096/(4096+3072+3072) CPUShares, the PHP service will get 3072/(4096+3072+3072) CPUShares, etc.
Note2: There are some other services in the system.slice (crond, postfix, chronyd, etc). But, as they are not very hungry, they will not consume their default allocated CPU resources (1024) and will not change anything to the situation. However, even though the Apache+MariaDB+PHP services use all their CPU resources, because the way it works, there will be still some resources for the other services.
Caution: Once you set up CPUShares CGroup restriction on one service in the system.slice, all the services there get CPUShares CGroup activated: even though you don’t specify anything, all new service started will be restricted to 1024 CPUShares by default. It is not possible to CPU-restrict some services and let the others without restriction. For a detailed explanation of the mechanism, see All control groups belong to us! video below in the Additional Resources section.
On this topic you can also:
- listen to CGroups (7min/2014) record for the full CGroups history,
- watch All control groups belong to us! (55min/2013) video to get some explanations from Systemd‘s creators,
- watch Georgios Magklaras’ demo (24min/2014),
- look at Andy Grimm’s Introduction to CGroups (60min/2014) for an explanation about children CPUShares computations,
- read Radoslaw Kujawa’s blog about CGroups:
RHEL/CentOS 7 service resource management with cgroups, RHEL/CentOS 7 run-time and session resource management with cgroups,
- read this Red Hat article about Controlling resources with cgroups for performance testing,
- read the official documentation: RHEL 7 Resource Management Guide,
- watch Lennart Poettering’s Systemd.conf 2015 presentation (45min/2015),
- read this page about the policies followed by control groups,
- read Zhenyun Zhuang‘s article about Addressing Memory-Related Performance Pitfalls of Cgroups,
- read Marc Richter‘s series of articles about CGroups: CGroup basics, Turning knobs, Thanks for the memories, All the I/Os, and Hand rolling your own cgroup,
- watch Sander van Vugt‘s presentation about Managing performance parameters through Systemd (33min/2017),
- watch this Facebook presentation at FOSDEM 2017 about the future of CGroups (39min/2017) (pdf).