Linux进程与线程限制 – 徐霁的博客

内容目录

Linux进程与线程

概念就不提了，Richard Stevens的描述：

fork is expensive. Memory is copied from the parent to the child, all descriptors are duplicated in the child, and so on. Current implementations use a technique called copy-on-write, which avoids a copy of the parent’s data space to the child until the child needs its own copy. But, regardless of this optimization, fork is expensive. IPC is required to pass information between the parent and child after the fork. Passing information from the parent to the child before the fork is easy, since the child starts with a copy of the parent’s data space and with a copy of all the parent’s descriptors. But, returning information from the child to the parent takes more work. Threads help with both problems. Threads are sometimes called lightweight processes since a thread is “lighter weight” than a process. That is, thread creation can be 10–100 times faster than process creation. All threads within a process share the same global memory. This makes the sharing of information easy between the threads, but along with this simplicity comes the problem.

Linux中创建进程用fork操作，线程用clone操作。通过ps -ef看到的是进程列表，线程可以通过ps -eLf来查看。用top命令的话，通过H开关也可以切换到线程视图。

具体到Java线程模型，规范是没有规定Java线程和系统线程的对应关系的，不过目前常见的实现是一对一的。参考http://openjdk.java.net/groups/hotspot/docs/RuntimeOverview.html#Thread%20Management|outline

问题排查思路

如果创建不了Java线程，报错是

Exception in thread “main” java.lang.OutOfMemoryError: unable to create new native thread

下面是常见的问题原因：

内存太小

在Java中创建一个线程需要消耗一定的栈空间，默认的栈空间是1M(可以根据应用情况指定-Xss参数进行调整)，栈空间过小或递归调用过深，可能会出现StackOverflowError。

对于一个进程来说，假设一定量可使用的内存，分配给堆空间的越多，留给栈空间的就越少。这个限制常见于32位Java应用，进程空间4G，用户空间2G(Linux下3G，所以通常堆可以设置更大一些)，减去堆空间大小(通过-Xms、-Xmx指定范围)，减去非堆空间(其中永久代部分通过PermSize、MaxPermSize指定大小，在Java8换成了MetaSpace，默认不限制大小)，再减去虚拟机自身消耗，剩下的就是栈空间，假设剩下300M，那么理论上就限制了只能开300线程。不过对于64位应用，由于进程空间近乎无限大，所以可以不考虑这个问题。

ulimit限制

线程数还会受到系统限制，系统限制通过ulimit -a可以查看到。

https://ss64.com/bash/ulimit.html

caixj@Lenovo-PC:~$ ulimit -acore file size          (blocks, -c) 0data seg size           (kbytes, -d) unlimitedscheduling priority             (-e) 0file size               (blocks, -f) unlimitedpending signals                 (-i) 7823max locked memory       (kbytes, -l) 64max memory size         (kbytes, -m) unlimitedopen files                      (-n) 1024pipe size            (512 bytes, -p) 8POSIX message queues     (bytes, -q) 819200real-time priority              (-r) 0stack size              (kbytes, -s) 8192cpu time               (seconds, -t) unlimitedmax user processes              (-u) 7823virtual memory          (kbytes, -v) unlimitedfile locks                      (-x) unlimited

参数sys.kernel.threads-max限制

https://www.kernel.org/doc/Documentation/sysctl/kernel.txt

This value controls the maximum number of threads that can be createdusing fork().During initialization the kernel sets this value such that even if themaximum number of threads is created, the thread structures occupy onlya part (1/8th) of the available RAM pages.The minimum value that can be written to threads-max is 20.The maximum value that can be written to threads-max is given by theconstant FUTEX_TID_MASK (0x3fffffff).If a value outside of this range is written to threads-max an errorEINVAL occurs.The value written is checked against the available RAM pages. If thethread structures would occupy too much (more than 1/8th) of theavailable RAM pages threads-max is reduced accordingly.

表示系统全局的总线程数限制。设置方式有:

# 方式1 运行时限制,临时生效echo 999999 > /proc/sys/kernel/threads-max# 方式2 修改/etc/sysctl.conf，永久生效sys.kernel.threads-max = 999999

参数sys.kernel.pid_max限制

https://www.kernel.org/doc/Documentation/sysctl/kernel.txt

PID allocation wrap value.  When the kernel's next PID valuereaches this value, it wraps back to a minimum PID value.PIDs of value pid_max or larger are not allocated.

表示系统全局的PID号数值的限制。设置方式有:

# 方式1 运行时限制,临时生效echo 999999 > /proc/sys/kernel/pid_max# 方式2 修改/etc/sysctl.conf，永久生效sys.kernel.pid_max = 999999

参数sys.vm.max_map_count限制

https://www.kernel.org/doc/Documentation/sysctl/vm.txt

This file contains the maximum number of memory map areas a processmay have. Memory map areas are used as a side-effect of callingmalloc, directly by mmap, mprotect, and madvise, and also when loadingshared libraries.While most applications need less than a thousand maps, certainprograms, particularly malloc debuggers, may consume lots of them,e.g., up to one or two maps per allocation.The default value is 65536.

表示单个程序所能使用内存映射空间的数量限制。设置方式有:

# 方式1 运行时限制,临时生效echo 999999 > /proc/sys/vm/max_map_count# 方式2 修改/etc/sysctl.conf，永久生效sys.vm.max_map_count = 999999

在其他资源可用的情况下，单个vm能开启的最大线程数是这个值的一半，可以通过cat /proc/PID/maps | wc -l查看目前使用的映射数量。

至于为什么只有一半，结合一些材料和源码分析了一下:

常见的警告信息是这样的，见JavaThread::create_stack_guard_pages()

Attempt to protect stack guard pages failed.Attempt to deallocate stack guard pages failed.

见current_stack_region()的图示，结合一下R大的相关解释:http://hllvm.group.iteye.com/group/topic/37717

如下图所示，通常的Java线程，会包括一个glibc的guard page和HotSpot的guard pages，其中JavaThread::create_stack_guard_pages()就是创建HotSpot Guard Pages用的，这里正常应该会有2次VMA，所以最大值只能有一半，从/proc/PID/maps中也可以看到增加一个线程会增加2个地址相连的映射空间。

// Java thread:////   Low memory addresses//    +------------------------+//    |                        |\  JavaThread created by VM does not have glibc//    |    glibc guard page    | - guard, attached Java thread usually has//    |                        |/  1 page glibc guard.// P1 +------------------------+ Thread::stack_base() - Thread::stack_size()//    |                        |\//    |  HotSpot Guard Pages   | - red and yellow pages//    |                        |///    +------------------------+ JavaThread::stack_yellow_zone_base()//    |                        |\//    |      Normal Stack      | -//    |                        |/// P2 +------------------------+ Thread::stack_base()//// Non-Java thread:////   Low memory addresses//    +------------------------+//    |                        |\//    |  glibc guard page      | - usually 1 page//    |                        |/// P1 +------------------------+ Thread::stack_base() - Thread::stack_size()//    |                        |\//    |      Normal Stack      | -//    |                        |/// P2 +------------------------+ Thread::stack_base()//// ** P1 (aka bottom) and size ( P2 = P1 - size) are the address and stack size returned from//    pthread_attr_getstack()

cgroup限制

现在新点的操作系统采用systemd的init程序，支持cgroup控制特性。docker的资源隔离底层技术就是这个。

其中有个重要的限制就是最大任务数TasksMax,通过设置cgroup的pids.max来限制。例如suse sp2的发行说明，见https://www.suse.com/releasenotes/x86_64/SUSE-SLES/12-SP2/#fate-320358

If you notice regressions, you can change a number of TasksMax settings.To control the default TasksMax= setting for services and scopes running on the system, use the system.conf setting DefaultTasksMax=. This setting defaults to 512, which means services that are not explicitly configured otherwise will only be able to create 512 processes or threads at maximum.For thread- or process-heavy services, you may need to set a higher TasksMax value. In such cases, set TasksMax directly in the specific unit files. Either choose a numeric value or even infinity.Similarly, you can limit the total number of processes or tasks each user can own concurrently. To do so, use the logind.conf setting UserTasksMax (the default is 12288).nspawn containers now also have a TasksMax value set, with a default of 16384.

上面的描述，说明

对于登录会话，有个默认的限制UserTasksMax，配置在/etc/systemd/logind.conf，限制了某个用户的默认的总任务数，例如上面限制了最大12288，修改这个配置文件可以通过systemctl restart systemd-logind重新加载

对于服务来说，配置在/etc/systemd/system.conf的DefaultTasksMax参数，默认是512(不同的发行版很可能不一样)，如果需要定制，需要根据服务独立配置

上面提到的是cgroup的默认全局设置，也可以细化到某个进程的限制。具体功能可以参考Linux Cgroup系列（03）：限制cgroup的进程数（subsystem之pids）

通过find /sys/fs/cgroup -name “pids.max” 可以看到各种细化的配置，例如./pids/user.slice/user-1000.slice/pids.max就是id为1000的用户的限制，相当于覆盖了上面logind.conf的默认设置，修改这个值会立即生效。

要查看某个进程的具体限制，可以通过/proc/PID/cgroup查看运行时状态，其中里边有pids.max就是对应的限制情况。详细点的可以看看这个案例:https://zhuanlan.zhihu.com/p/29192624

文章转载于：https://mccxj.github.io/blog/20171230_os-thread-limit.html