Linux Open Book: April 2016

What are the implications of changing socket buffer sizes?

We will discuss What are the implications of changing the values of the following parameters

/proc/sys/net/core/rmem_default = 524288
/proc/sys/net/core/rmem_max = 524288
/proc/sys/net/core/wmem_default = 52428
/proc/sys/net/core/wmem_max = 524288

Increasing the rmem/wmem will increase the buffer size allocated to every socket opened on the system. These values need to be tuned as per your environment and requirements. A higher value may increase throughput to some extent, but will affect latency. So, you need to determine which is important for you, and a value can only be arrived by repetitive testing.

When buffering is enabled, a packet received is not immediately processed by the receiving application. With a large buffer this delay gets increased, as the packet has to wait for the buffer backlog to be emptied before it gets it's turn for processing.

Buffering is good to increase throughput, because by keeping the buffer full, the receiving application will always have data to process. But, this affects latency, as packets have to wait longer in the buffer before being processed. For more information on this also visit: Bufferbloat: http://en.wikipedia.org/wiki/Bufferbloat

kernel: Out of socket memory

Solution for this is to increase the TCP memory. This can be done by adding the following parameters to /etc/sysctl.conf.

net.core.wmem_max=12582912
net.core.rmem_max=12582912
net.ipv4.tcp_rmem= 10240 87380 12582912
net.ipv4.tcp_wmem= 10240 87380 12582912

These figures are just an example and need to be tuned per system basis. On the similar lines tcp_max_orphans sysctl variable value can be increased but it has memory overhead of ~64K per orphan entry and needs careful tuning.

For more information on tuning socket buffers refer to: How to tune the TCP Socket Buffers?

There are three factors which may cause the problem,

The networking behavior of your system. for example, how many TCP socket created on your system.
How much system RAM in your system.
The following two system kernel parameters.

/proc/sys/net/ipv4/tcp_mem
/proc/sys/net/ipv4/tcp_max_orphans

an example in 1GB RAM system
# cat /proc/sys/net/ipv4/tcp_max_orphans 
32768

# cat /proc/sys/net/ipv4/tcp_mem 
98304     131072     196608

The meaning of the two kernel parameters,

tcp_max_orphans -- Maximal number of TCP sockets not attached to any user file handle held by system. If this number is exceeded orphaned connections are reset immediately and warning is printed. The default value of this parameter on RHEL5.2 is 32768.
tcp_mem -- vector of 3 INTEGERs: min, pressure, max.

min: below this number of pages TCP is not bothered about its memory appetite.
pressure: when amount of memory allocated by TCP exceeds this number of pages, TCP moderates its memory consumption and enters memory pressure mode, which is exited when memory consumption falls under "min". The memory pressure mode presses down the TCP receive and send buffers for all the sockets as much as possible, until the low mark is reached again.
max: number of pages allowed for queuing by all TCP sockets.

If the number of orphan socket is more than the value of tcp_max_orphans, there may trigger the messages "kernel: Out of socket memory".

If the total number of memory page which are assigned to all the system TCP socket is more than the max value of tcp_mem., there may trigger the messages "kernel: Out of socket memory".

Both the situations above will trigger the messages "kernel: Out of socket memory".

How much memory is use by TCP/UDP across the system?

Logic behind killing processes during an Out of Memory situation

A simplified explanation of the OOM-killer logic follows.
A function called badness() is defined to calculate points for each processes. Points are added to:

Processes with high memory usage
Niced processes

Badness points are subtracted from:

Processes which have been running for a long time
Processes which were started by superusers
Process with direct hardware access

The process with the highest number of badness points will be killed, unless it is already in the midst of freeing up memory on its own. (Note that if a processes has 0 points it can not be killed.)

The kernel will wait for some time to see if enough memory is freed by killing one process. If enough memory is not freed, the OOM-kills will continue until enough memory is freed or until there are no candidate processes left to kill. If the kernel is out of memory and is unable to find a candidate process to kill, it panics with a message like:

How much memory is use by TCP/UDP across the system?

Socket memory usage is visible in /proc/net/sockstat.

sockets: used 870
TCP: inuse 21 orphan 0 tw 0 alloc 28 mem 10
UDP: inuse 9 mem 6
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

This shows that the system currently using 10 pages for TCP sockets and 6 pages for UDP sockets. Note that the low, threshold and high settings in net.ipv4.tcp_mem are set in pages too.

How to find a process using the ipcs shared memory segment

Process id can be identified using lsof

# lsof | egrep "shmid"

# ipcs -m
------ Shared Memory Segments --------
key               shmid      owner      perms      bytes      nattch     status      
0x0052e2c1 4194319    postgres  600        11083776   2

From the above output you gets the shmid

# lsof | egrep "4194319"
postmaste 15289  postgres  DEL       REG              0,9               4194319 /SYSV0052e2c1
postmaste 15293  postgres  DEL       REG              0,9               4194319 /SYSV0052e2c1

And as shown above, 15289 and 15293 are the process IDs.

What is transparent hugepages

Transparent Huge Pages (THP) are enabled by default in RHEL 6 for all applications. The kernel attempts to allocate hugepages whenever possible and any Linux process will receive 2MB pages if the mmap region is 2MB naturally aligned. The main kernel address space itself is mapped with hugepages, reducing TLB pressure from kernel code. For general information on Hugepages, see: What are Huge Pages ?

The kernel will always attempt to satisfy a memory allocation using hugepages. If no hugepages are available (due to non availability of physically continuous memory for example) the kernel will fall back to the regular 4KB pages. THP are also swappable (unlike hugetlbfs). This is achieved by breaking the huge page to smaller 4KB pages, which are then swapped out normally.

But to use hugepages effectively, the kernel must find physically continuous areas of memory big enough to satisfy the request, and also properly aligned. For this, a khugepaged kernel thread has been added. This thread will occasionally attempt to substitute smaller pages being used currently with a hugepage allocation, thus maximizing THP usage.

Also, THP is only enabled for anonymous memory regions. There are plans to add support for tmpfs and page cache. THP tunables are found in the /sys tree under /sys/kernel/mm/redhat_transparent_hugepage.

The values for /sys/kernel/mm/redhat_transparent_hugepage/enabled can be one of the following:

always - always use THP

never    -  disable THP

khugepaged will be automatically started when transparent_hugepage/enabled is set to "always" or "madvise, and it'll be automatically shutdown if it's set to "never". The redhat_transparent_hugepage/defrag parameter takes the same values and it controls whether the kernel should make aggressive use of memory compaction to make more hugepages available.

To Check system-wide THP usage

Run the following command to check system-wide THP usage:

# grep AnonHugePages /proc/meminfo 
AnonHugePages:    632832 kB

Note: Red Hat Enterprise Linux 6.2 or later publishes additional THP monitoring via /proc/vmstat

# egrep 'trans|thp' /proc/vmstat
nr_anon_transparent_hugepages 2018
thp_fault_alloc 7302
thp_fault_fallback 0
thp_collapse_alloc 401
thp_collapse_alloc_failed 0
thp_split 21

To Check THP usage per process

# grep -e AnonHugePages  /proc/*/smaps | awk  '{ if($2>4) print $0} ' |  awk -F "/"  '{print $0; system("ps -fp " $3)} '
/proc/7519/smaps:AnonHugePages:    305152 kB
UID        PID  PPID  C STIME TTY          TIME CMD
qemu      7519     1  1 08:53 ?        00:00:48 /usr/bin/qemu-system-x86_64 -machine accel=kvm -name rhel7 -S -machine pc-i440fx-1.6,accel=kvm,usb=of
/proc/7610/smaps:AnonHugePages:    491520 kB
UID        PID  PPID  C STIME TTY          TIME CMD
qemu      7610     1  2 08:53 ?        00:01:30 /usr/bin/qemu-system-x86_64 -machine accel=kvm -name util6vm -S -machine pc-i440fx-1.6,accel=kvm,usb=
/proc/7788/smaps:AnonHugePages:    389120 kB
UID        PID  PPID  C STIME TTY          TIME CMD
qemu      7788     1  1 08:54 ?        00:00:55 /usr/bin/qemu-system-x86_64 -machine accel=kvm -name rhel64eus -S -machine pc-i440fx-1.6,accel=kvm,us

To disable THP at run time

Append the following to the kernel command line in grub.conf
transparent_hugepage=never

Run the following commands to disable THP on-the-fly:

# echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
# echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

NOTE:Running the above commands will stop only creation and usage of the new THP. The THP which were created and used at the moment the above commands were run would not be disassembled into the regular memory pages. To get rid of THP completely the system should be rebooted with THP disabled at boot time.

How to tell if Explicit HugePages is enabled or disabled

There can be two types of HugePages in the system: Explicit Huge Pages which are allocated explicitly by vm.nr_hugepages sysctl parameter and Tranparent Huge Pages which are allocated automatically by the kernel. See below on how to tell if Explicit HugePages is enabled or disabled.

Explicit HugePages DISABLED

If the value of HugePages_Total is "0" it means HugePages is disabled on the system.

# grep -i HugePages_Total /proc/meminfo 
HugePages_Total:       0

Similarly, if the value in /proc/sys/vm/nr_hugepages file or vm.nr_hugepages sysctl parameter is "0" it means HugePages is disabled on the system:

# cat /proc/sys/vm/nr_hugepages 
0
# sysctl vm.nr_hugepages
vm.nr_hugepages = 0

Explicit HugePages ENABLED

If the value of HugePages_Total is greater than "0", it means HugePages is enabled on the system

# grep -i HugePages_Total /proc/meminfo 
HugePages_Total:       1024

Similarly if the value in /proc/sys/vm/nr_hugepages file or vm.nr_hugepages sysctl parameter is greater than "0", it means HugePages is enabled on the system:

# cat /proc/sys/vm/nr_hugepages 
1024
# sysctl vm.nr_hugepages
vm.nr_hugepages = 1024

What are Huge Pages

Hugepages is a feature that allows the Linux kernel to utilize the multiple page size capabilities of modern hardware architectures. Linux creates multiple pages of virtual memory, mapped from both physical RAM and swap. A page is the basic unit of virtual memory, with the default page size being 4096 Bytes in the x86 architecture.

Linux uses a mechanism in the CPU architecture called "Translation Lookaside Buffers" (TLB) to manage the mapping of virtual memory pages to actual physical memory addresses. The TLB is a limited hardware resource, so utilizing a huge amount of physical memory with the default page size consumes the TLB and adds processing overhead - many pages of size 4096 Bytes equates to many TLB resources consumed. By utilizing Huge Pages, we are able to create pages of much larger sizes, each page consuming a single resource in the TLB. A side effect of creating Huge Pages is that the physical memory that is mapped to a Huge Page is no longer subject to normal memory allocations or managed by the kernel virtual memory manager, so Huge Pages are essentially 'protected' and are available only to applications that request them. Huge Pages are 'pinned' to physical RAM and cannot be swapped/paged out.

A typical purpose for allocating Huge Pages is for an application that has characteristic high memory use, and you wish to ensure that the pages it uses are never swapped out when the system is under memory pressure. Another purpose is to manage memory usage on a 32bit system - Creating Huge Pages and configuring applications to use them will reduce the kernel's memory management overhead since it will be managing fewer pages. The kernel virtual memory manager utilizes low memory - fewer pages to manage means it will consume less low memory.

In the Linux 2.6 series of kernels, hugepages is enabled using the CONFIG_HUGETLB_PAGE feature when the kernel is built. All kernels supplied by Red Hat for the Red Hat Enterprise Linux 4 release and later releases have the feature enabled.

Systems with large amount of memory can be configured to utilize the memory more efficiently by setting aside a portion dedicated for hugepages. The actual size of the page is dependent on the system architecture. A typical x86 system will have a Huge Page Size of 2048 kBytes. The huge page size may be found by looking at /proc/meminfo

# cat /proc/meminfo |grep Hugepagesize
Hugepagesize: 2048 kB

In RHEL 6 or later, the hugepagesize can be displayed using the following command (this value is in bytes):

# hugeadm --page-sizes-all
2097152

Note: Enabling hugepages requires the kernel to find contiguous, aligned unallocated regions of memory. For most systems, this means that a reboot will be required to allocate the hugepages.

How to limit memory usage per user in Linux

It is possible to limit the amount of memory that each process can allocate with the RLIMIT facility in Red Hat Enterprise Linux. RLIMIT_AS limits the size of the virtual address space and it is a suitable solution in many cases which require limiting per-process memory consumption even though there are many limits related to memory.

# vi /etc/security/limits.conf

....

daniel hard as 1048576
....

# ulimit -v

1048576

In the example above, processes running as 'daniel' user can only use 1GB virtual address space. So, each application used by daniel cannot allocate more than 1 GB memory space.

Note : this limitation does not affect an amount of physical memory directly, and the values are in KiB (1024-byte).

hugepages support in a XEN virtual guest

hugepages cannot be used in a paravirtualized XEN guest. RHEL 5 doesn't have this feature enabled. If vm.nr_hugepages is set in RHEL 6 the pages can be allocated, but as soon as they are used the paravirtualized guest will crash.

hugepages can be used in a fully virtualized (HVM) XEN guest as well as in a KVM environment.

On an fully virtualized guest, it will report the BIOS vendor as "xen" and the System product name will be "HVM domU".

[root@test2 ~]# dmidecode
# dmidecode 2.12
SMBIOS 2.4 present.
10 structures occupying 347 bytes.
Table at 0x000E901F.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
    Vendor: Xen
    Version: 3.1.2-371.4.1.el5
    Release Date: 05/07/2013
    Address: 0xE8000
    Runtime Size: 96 kB
    ROM Size: 64 kB
    Characteristics:
        Targeted content distribution is supported
    BIOS Revision: 3.1

Handle 0x0100, DMI type 1, 27 bytes
System Information
    Manufacturer: Red Hat
    Product Name: HVM domU
    Version: 3.1.2-371.4.1.el5
    Serial Number: 4c63d1ea-8876-89a3-d0e7-342fe676a41f
    UUID: 4C63D1EA-8876-89A3-D0E7-342FE676A41F
    Wake-up Type: Power Switch
    SKU Number: Not Specified
    Family: Red Hat Enterprise Linux

On a paravirtualized guest, dmidecode will be empty

[root@test ~]# dmidecode
# dmidecode 2.11
# No SMBIOS nor DMI entry point found, sorry.

How to avoid paging of memory for a process

To avoid paging the programs / Process should lock the memory using mlock system calls during startup

The mlock and mlockall system calls tell the system to lock to a specified memory range, and to not allow that memory to be paged. This means that once the physical page has been allocated to the page table entry, references to that page will not fault again.

There are two groups of mlock system calls available. The mlock and munlock calls lock and unlock a specific range of addresses. The mlockall and munlockall calls lock or unlock the entire program space.

Use of the mlock calls should be examined carefully and used with caution. If the application is large, or if it has a large data domain, the mlock calls can cause thrashing if the system cannot allocate memory for other tasks. If the application is entering a time sensitive region of code, an mlockall call prior to entering, followed by munlockall can reduce paging while in the critical section. Similarly, mlock can be used on a data region that is relatively static or that will grow slowly but needs to be accessed without page faulting.

Use of mlock will not guarantee that the program will experience no page faults. It is used to ensure that the data will stay in memory, but can not ensure that it will stay in the same page. Other functions such as move_pages and memory compactors can move data around despite the mlock.

Important Note : Always use mlock with care. Using it excessively can lead to an out of memory (OOM) error. Do not just put an mlockall call at the start of your application. It is recommended that only the data and text of the real-time portion of the application be locked

mlock is controlled by the ulimit parameter " max locked memory " and you can set the same by using below mentioned command

# ulimit -l 1024

We can monitor the usage of locked memory in /proc/meminfo

How cronjobs are handled when the time on a server changes

Cronjobs are handled according to the amount of time "missed" by cron. By checking a timer specific to cron against the currently reported system time, cron breaks "missed" time into one of 4 scenario

If the difference in time is greater than 1 minute, but less than 5 minutes, crond increments its internal timer by one minute every 10 seconds until its time is caught up with the system's time, running all scheduled jobs for each of these "virtual" 10 second minutes.

If the difference in time is greater than 5 minutes, but less than 3 hours, crond first runs any "wildcard" jobs (scheduled for every X minutes or hours) for the current system time, and then begins running missed "fixed-time" jobs as above, starting those jobs for cron's current "virtual" minute, sleeping for 10 seconds, and then starting jobs for the next "virtual" minute.

If the difference in time is greater than +/-3 hours, crond simply adjusts its internal time to the current system time and runs any current jobs. Jobs that would have fallen in the intervening time are skipped.

If the difference in time is a negative time jump, i.e. difference in time is between 0 and -3 hours, crond runs any "wildcard" jobs (scheduled for every X minutes or hours) for the current system time and ignores any "fixed-time" or missed "wildcard" jobs. crond does not update its internal time in this scenario, instead waiting for the system's time to catch up.