Understanding Processes in Linux

Understanding Processes

Processes, Light-weight Processes, Threads and Tasks

Let us understand the concepts of processes, threads and tasks in Linux.

A process is an instance of a program in execution. A process is composed of several user threads (or simply threads), each of which represents an execution flow of the process. Nowadays, most multithreaded applications are written using standard sets of library functions called pthread (POSIX thread) libraries. pthreads run in user space, They are merely an abstraction for the programmer. However simply using pthreads does not provide the true benefits of multi-threading since all the threads share a single execution context and if one of the threads is blocked on a particular system call like read, the whole process will be blocked because kernel is oblivious to the threads.

Linux therefore provides lightweight processes to offer better support for multithreaded applications. Basically, two lightweight processes may share some resources, like the address space, the open files, and so on. Whenever one of them modifies a shared resource, the other immediately sees the change. However each lightweight process has an independent execution context and is treated as an independent process by the Kernel. For a deeper understanding lets take a look at the way a new process is created in Linux. One can use either the fork() or the clone() command to create a new process. A fork() always creates a completely independent process which does not share the address space of the parent process (though a forked process does start with a pointer to the same address space and a copy-on-write model is used to optimize space utilization). A clone() on the other hand allows granular control over process creation and one can specify whether the child process should share the address space, open files, signals etc with the parent. A process created using clone() which shares these attributes with its parents is known as a light-weight process. In effect therefore in Linux everything is a process which either shares the resources of its parent OR does not. In fact fork() is implemented as a wrapper over clone() by setting all flags to share nothing between the parent and child processes.

A straightforward way to implement multithreaded applications is to associate a lightweight process with each thread created in the pthreads library. In this way, the threads can access the same set of application data structures by simply sharing the same memory address space, the same set of open files, and so on; at the same time, each thread can be scheduled independently by the kernel so that one may sleep while another remains runnable. There is an advantage of this implementation. To the kernel everything is seen as processes and the scheduler does not distinguish between threads and processes. LWP and threads are both used interchangeably to describe an LWP, since typically any application runtime that supports the creation of threads, always creates an LWP underlying the thread though it can be otherwise. In this document therefore we use a thread, a task and an LWP to mean the same.

Each process (including a LWP) has its own pid. However POSIX standards state that all threads of a multi-threaded application must have the same PID. Linux overcomes this by making use of thread groups. Each thread belongs to a group and the PID of the first thread (also known as the group leader) created in a thread group is stored in a field called tgid (thread-group id). Linux always returns this field as the pid of a thread, as opposed to the actual pid of the thread. Incase of a process with a single thread the threadgroupid and the pid are the same. Check the below examples

[user@server ~]$ cat /proc/10200/status
Name: postgres
State: S (sleeping)
SleepAVG: 98%
Tgid: 10200
Pid: 10200
PPid: 14860

[user@server ~]$ cat /proc/14860/status
Name: postgres
State: S (sleeping)
SleepAVG: 98%
Tgid: 14860
Pid: 14860
PPid: 1

The gettid() call in Linux returns the actual pid of a LWP if the LWP is part of a thread group and is not the group leader. One can find out the actual pids and statuses of the threads within a process using the /proc filesystem as follows -

[user@server ~]$ cat /proc/23638/status
Name: mysqld
State: S (sleeping)
SleepAVG: 98%
Tgid: 23638
Pid: 23638
PPid: 2561

[user@server ~]$ cat /proc/23638/task/14514/status
Name: mysqld
State: S (sleeping)
SleepAVG: 98%
Tgid: 23638
Pid: 14514
PPid: 2561

Note in the above example the first process represents the mysqld process, while the second process represents a thread or a LWP within the first mysqld process. Every thread that is a part of the mysqld process (pid: 23638) would contain a folder structure within /task/

One can use "ps H -Le" to display all threads as if they were processes. Alternatively one can use top and toggle display of threads using the interactive "i" switch (this does not show the actual thread id). Infact when working with processes, the commands you will generally use are ps, top, htop (a better version of top), and the /proc filesystem

User-space concurrency model:
An excellent example of a concurrency model executed in the user space is the Scala actor model. Actors in scala represent a concurrency abstraction for a user. However each actor does not map to a thread. Instead actors are executed on a thread pool. Ideally, the size of the thread pool corresponds to the number of processor cores of the machine. The thread pool grows if all the worker threads are blocked but there are still remaining tasks to be processed. Erlang has a similar user-space concurrency model.

Process States

A process (which includes a thread) on a Linux machine can be in any of the following states -

  • TASK_RUNNING - The process is either executing on a CPU or waiting to be executed.
  • TASK_INTERRUPTIBLE - The process is suspended (sleeping) until some condition becomes true. Raising a hardware interrupt, releasing a system resource the process is waiting for, or delivering a signal are examples of conditions that might wake up the process (put its state back to TASK_RUNNING). Typically blocking IO calls (disk/network) will result in the task being marked as TASK_INTERRUPTIBLE. As soon as the data it is waiting on is ready to be read an interrupt is raised by the device and the interrupt handler changes the state of the task to TASK_INTERRUPTIBLE. Also processes in idle mode (ie not performing any task) should be in this state.
  • TASK_UNINTERRUPTIBLE - Like TASK_INTERRUPTIBLE, except that delivering a signal to the sleeping process leaves its state unchanged. This process state is seldom used. It is valuable, however, under certain specific conditions in which a process must wait until a given event occurs without being interrupted. Ideally not too many tasks will be in this state.
    • For instance, this state may be used when a process opens a device file and the corresponding device driver starts probing for a corresponding hardware device. The device driver must not be interrupted until the probing is complete, or the hardware device could be left in an unpredictable state.
    • Atomic write operations may require a task to be marked as UNINTERRUPTIBLE
    • NFS access sometimes results in access processes being marked as UNINTERRUPTIBLE
    • reads/writes from/to disk can be marked thus for a fraction of a second
    • I/O following a page fault marks a process UNINTERRUPTIBLE
    • I/O to the same disk that is being accessed for page faults can result in a process marked as UNINTERRUPTIBLE
    • Programmers may markl a task as UNINTERRUPTIBLE instead of using INTERRUPTIBLE
  • TASK_STOPPED - Process execution has been stopped; the process enters this state after receiving a SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU signal
  • TASK_TRACED - Process execution has been stopped by a debugger
  • EXIT_ZOMBIE - Process execution is terminated, but the parent process has not yet issued a wait4( ) or waitpid( ) system call. The OS will not clear zombie processes until the parent issues a wait()-like call
  • EXIT_DEAD - The final state: the process is being removed by the system because the parent process has just issued a wait4( ) or waitpid( ) system call for it. Changing its state from EXIT_ZOMBIE to EXIT_DEAD avoids race conditions due to other threads of execution that execute wait( )-like calls on the same process.

Only processes that are in the TASK_RUNNING are candidates for using a free cpu. If no task is in RUNNING state then the cpu will remain idle. All tasks in the RUNNING state compete for CPU time (alongwith kernel tasks). The kernel scheduler determines based on task priority as to which task should be given a slice of the cpu time and for what duration.

Additionally processes are organized into sets of sessions. The session's ID is the same as the pid of the process that created the session. That process is known as the session leader for that session group. All of that process's descendants are then members of that session unless they specifically remove themselves from it.

Understanding load average

Load average refers to the average number of processes (including threads) that have been waiting in a certain time period. While conventionally this accounts for processes in TASK_RUNNING state that are waiting for cpu, in Linux this also takes into account processes marked as uninterruptible sleep. Therefore the average number of processes waiting in either TASK_RUNNING and/or TASK_UNINTERRUPTIBLE for a period of time signifies load average. This value is computed using an exponential decay formula. Ideally this number signifies processes starving for CPU (or possibly even the disk incase the disk IO processes are in an UNINTERRUPTIBLE state). If there are no processes marked as UNINTERRUPTIBLE, the load average count should not be much higher than the count of cpu cores in your machine. A higher load average signifies that there are processes waiting for cpu.

Understanding Process Priorities

Each process has a process priority which is a number between 100 (highest priority) to 139 (lowest priority). The time quantum each process gets from the scheduler is dependent on its priority. As an eg a priority value of 100 will give a time quantum of 800ms to a process while a value of 139 will result in a time quantum of 5ms. While a process may start out with a static priority the kernel computes a dynamic priority for each process based on its average sleep time. The average sleep time also determine whether a process should be treated as interactive or batch and the scheduling of a process changes based on this determination.

Monitoring processes

Using top to check process states

The top command shows tasks currently running -

[user@server ~]$ top
top - 03:37:45 up 5 days, 7:57, 12 users, load average: 7.24, 5.68, 5.09
Tasks: 471 total, 15 running, 456 sleeping, 0 stopped, 0 zombie
Cpu(s): 39.2%us, 8.8%sy, 9.1%ni, 38.8%id, 1.5%wa, 0.0%hi, 2.6%si, 0.0%st
Mem: 132093140k total, 131496368k used, 596772k free, 380832k buffers
Swap: 2096472k total, 492k used, 2095980k free, 126816660k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12733 root 39 19 4724 1316 380 R 72.9 0.0 134:23.01 lzop
6899 postgres 16 0 3602m 1.8g 1.8g R 22.1 1.4 0:15.55 postgres
6857 postgres 15 0 3603m 386m 379m R 17.8 0.3 0:07.49 postgres
6884 postgres 15 0 3602m 139m 134m R 12.9 0.1 0:05.85 postgres
11878 postgres 15 0 3620m 2.0g 2.0g R 12.2 1.6 24:49.97 postgres
6853 postgres 16 0 3602m 1.8g 1.8g R 11.9 1.4 0:15.17 postgres
14862 postgres 15 0 71596 912 544 R 10.9 0.0 552:45.73 postgres
7413 postgres 15 0 3600m 127m 122m R 10.2 0.1 0:04.79 postgres
25177 postgres 16 0 3632m 2.1g 2.0g R 9.9 1.6 57:45.83 postgres
9068 postgres 16 0 3602m 129m 124m R 8.9 0.1 0:05.17 postgres
9073 postgres 16 0 3600m 138m 133m R 7.3 0.1 0:05.35 postgres
6854 postgres 15 0 3600m 123m 118m D 6.3 0.1 0:05.18 postgres
9072 postgres 15 0 3602m 123m 118m R 5.9 0.1 0:05.51 postgres
6855 postgres 15 0 3602m 1.8g 1.8g R 4.9 1.4 0:13.39 postgres
9036 dushyant 15 0 13000 1388 816 R 1.0 0.0 0:00.33 top
24 root 34 19 0 0 0 R 0.0 0.0 0:50.72 ksoftirqd/7

As you can see in the above list, out of 471 tasks, 15 are running and 456 are sleeping. This means that 15 tasks are in the TASK_RUNNING state and 456 in the TASK_INTERRUPTIBLE/UNINTERRUPTIBLE state. Note the term "RUNNING" is a slight misnomer inasmuch as the above snapshot was taken on a machine with 8 cores, hence at any point in time only 8 of the 15 will get CPU attention while the remaining 7 will be waiting in the run queue. Tasks that are sleeping do not consume any CPU cycles. You can toggle "top" to only show tasks in TASK_RUNNING state by using the "i" toggle switch. Each task row shows the state of the task in the "S" column as one of 'D' = uninterruptible sleep 'R' = running 'S' = sleeping (interruptible) 'T' = traced or stopped 'Z' = zombie

Using ps to check process states

  • ps -eN r - show running tasks only
  • ps -e r - show all tasks except running tasks

The state of the process in ps is displayed using the following flags

  • D - Uninterruptible sleep
  • R - Running or runnable (on run queue)
  • S - Interruptible sleep (waiting for an event to complete)
  • T - Stopped, either by a job control signal or because it is being traced
  • X - dead (should never be seen)
  • Z - Defunct ("zombie") process, terminated but not reaped by its parent.
  • < - high-priority (not nice to other users)
  • N - low-priority (nice to other users)
  • L - has pages locked into memory (for real-time and custom IO)
  • s - is a session leader
  • l - is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
  • + - is in the foreground process group

Notice in the below output the "l" flag is set against the process state of mysqld signifying that it is multi-threaded

[user@server ~]$ ps -e r -N | grep "mysql"
2561 ? S 0:00 /bin/sh /usr/bin/mysqld_safe --datadir=/var/lib/mysql --pid-file=/var/lib/mysql/sessions.myorderbox.com.pid
13970 pts/6 S+ 0:00 grep mysql
23638 ? Sl 1055:34 /usr/sbin/mysqld --basedir=/ --datadir=/var/lib/mysql --user=mysql --log-error=/var/lib/mysql/sessions.myorderbox.com.err --pid-file=/var/lib/mysql/sessions.myorderbox.com.pid --socket=/var/lib/mysql/mysql.sock --port=3306

/proc/<pid>/wchan

[user@server ~]$ cat /proc/14860/wchan
_stext

The wchan field in the /proc/<pid> folder gives the kernel function on which the process is waiting. However wchan is broken on x86 systems where the SCHED_NO_NO_OMIT_FRAME_POINTER has been set to "y" (which is the default value). In those systems the wchan value within /proc/<pid>/stat will always return "0" which maps to _stext. Refer to http://lkml.org/lkml/2008/11/6/12 and http://lwn.net/Articles/292178/

/proc/status

$ cat /proc/$$/status
Name: bash
State: S (sleeping)
Tgid: 3515
Pid: 3515
PPid: 3452
TracerPid: 0
Uid: 1000 1000 1000 1000
Gid: 100 100 100 100
FDSize: 256
Groups: 16 33 100
VmPeak: 9136 kB
VmSize: 7896 kB
VmLck: 0 kB
VmHWM: 7572 kB
VmRSS: 6316 kB
VmData: 5224 kB
VmStk: 88 kB
VmExe: 572 kB
VmLib: 1708 kB
VmPTE: 20 kB
Threads: 1
SigQ: 0/3067
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000010000
SigIgn: 0000000000384004
SigCgt: 000000004b813efb
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
Cpus_allowed: 00000001
Cpus_allowed_list: 0
Mems_allowed: 1
Mems_allowed_list: 0
voluntary_ctxt_switches: 150
nonvoluntary_ctxt_switches: 545

The fields are as follows:

  • Name: Command run by this process.
  • State: Current state of the process. One of "R (running)", "S (sleeping)", "D (disk sleep)", "T (stopped)", "T (tracing stop)", "Z (zombie)", or "X (dead)".
  • Tgid: Thread group ID (i.e., Process ID).
  • Pid: Thread ID (see gettid(2)).
  • TracerPid: PID of process tracing this process (0 if not being traced).
  • Uid, Gid: Real, effective, saved set, and file system UIDs (GIDs).
  • FDSize: Number of file descriptor slots currently allocated.
  • Groups: Supplementary group list.
  • VmPeak: Peak virtual memory size.
  • VmSize: Virtual memory size.
  • VmLck: Locked memory size (see mlock(3)).
  • VmHWM: Peak resident set size ("high water mark").
  • VmRSS: Resident set size.
  • VmData, VmStk, VmExe: Size of data, stack, and text segments.
  • VmLib: Shared library code size.
  • VmPTE: Page table entries size (since Linux 2.6.10).
  • Threads: Number of threads in process containing this thread.
  • SigPnd, ShdPnd: Number of signals pending for thread and for process as a whole (see pthreads(7) and signal(7)).
  • SigBlk, SigIgn, SigCgt: Masks indicating signals being blocked, ignored, and caught (see signal(7)).
  • CapInh, CapPrm, CapEff: Masks of capabilities enabled in inheritable, permitted, and effective sets (see capabilities(7)).
  • CapBnd: Capability Bounding set (since kernel 2.6.26, see capabilities(7)).
  • Cpus_allowed: Mask of CPUs on which this process may run (since Linux 2.6.24, see cpuset(7)).
  • Cpus_allowed_list: Same as previous, but in "list format" (since Linux 2.6.26, see cpuset(7)).
  • Mems_allowed: Mask of memory nodes allowed to this process (since Linux 2.6.24, see cpuset(7)).
  • Mems_allowed_list: Same as previous, but in "list format" (since Linux 2.6.26, see cpuset(7)).
  • voluntary_context_switches, nonvoluntary_context_switches: Number of voluntary and involuntary context switches (since Linux 2.6.23).

/proc/<pid>/stat

Status information about the process. This is used by ps(1). It is defined in /usr/src/linux/fs/proc/array.c.

[user@server ~]$ cat /proc/7278/stat
7278 (postgres) S 1 7257 7257 0 -1 4202496 36060376 10845160168 0 749 20435 137212 158536835 39143290 15 0 1 0 50528579 3763298304 20289 18446744073709551615 4194304 7336916 140734091375136 18446744073709551615 225773929891 0 0 19935232 84487 0 0 0 17 2 0 0 12

The fields, in order, are:

  • pid: The process ID.
  • comm: The filename of the executable, in parentheses. This is visible whether or not the executable is swapped out.
  • state: One character from the string "RSDZTW" where R is running, S is sleeping in an interruptible wait, D is waiting in uninterruptible disk sleep, Z is zombie, T is traced or stopped (on a signal), and W is paging.
  • ppid: The PID of the parent.
  • pgrp: The process group ID of the process.
  • session: The session ID of the process.
  • tty_nr: The controlling terminal of the process. (The minor device number is contained in the combination of bits 31 to 20 and 7 to 0; the major device number is in bits 15 to 8.)
  • tpgid: The ID of the foreground process group of the controlling terminal of the process.
  • flags: The kernel flags word of the process. For bit meanings, see the PF_* defines in <linux/sched.h>. Details depend on the kernel version.
  • minflt: The number of minor faults the process has made which have not required loading a memory page from disk.
  • cminflt: The number of minor faults that the process's waited-for children have made.
  • majflt: The number of major faults the process has made which have required loading a memory page from disk.
  • cmajflt: The number of major faults that the process's waited-for children have made.
  • utime: Amount of time that this process has been scheduled in user mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK). This includes guest time, guest_time (time spent running a virtual CPU, see below), so that applications that are not aware of the guest time field do not lose that time from their calculations.
  • stime: Amount of time that this process has been scheduled in kernel mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK).
  • cutime: Amount of time that this process's waited-for children have been scheduled in user mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK). (See also times(2).) This includes guest time, cguest_time (time spent running a virtual CPU, see below).
  • cstime: Amount of time that this process's waited-for children have been scheduled in kernel mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK).
  • priority: (Explanation for Linux 2.6) For processes running a real-time scheduling policy (policy below; see sched_setscheduler(2)), this is the negated scheduling priority, minus one; that is, a number in the range -2 to -100, corresponding to real-time priorities 1 to 99. For processes running under a non-real-time scheduling policy, this is the raw nice value (setpriority(2)) as represented in the kernel. The kernel stores nice values as numbers in the range 0 (high) to 39 (low), corresponding to the user-visible nice range of -20 to 19. Before Linux 2.6, this was a scaled value based on the scheduler weighting given to this process.
  • nice: The nice value (see setpriority(2)), a value in the range 19 (low priority) to -20 (high priority).
  • num_threads: Number of threads in this process (since Linux 2.6). Before kernel 2.6, this field was hard coded to 0 as a placeholder for an earlier removed field.
  • itrealvalue: The time in jiffies before the next SIGALRM is sent to the process due to an interval timer. Since kernel 2.6.17, this field is no longer maintained, and is hard coded as 0.
  • starttime: The time in jiffies the process started after system boot.
  • vsize: Virtual memory size in bytes.
  • rss: Resident Set Size: number of pages the process has in real memory. This is just the pages which count toward text, data, or stack space. This does not include pages which have not been demand-loaded in, or which are swapped out.
  • rsslim: Current soft limit in bytes on the rss of the process; see the description of RLIMIT_RSS in getpriority(2).
  • startcode: The address above which program text can run.
  • endcode: The address below which program text can run.
  • startstack: The address of the start (i.e., bottom) of the stack.
  • kstkesp: The current value of ESP (stack pointer), as found in the kernel stack page for the process.
  • kstkeip: The current EIP (instruction pointer).
  • signal: The bitmap of pending signals, displayed as a decimal number. Obsolete, because it does not provide information on real-time signals; use /proc/[pid]/status instead.
  • blocked: The bitmap of blocked signals, displayed as a decimalnumber. Obsolete, because it does not provide information on real-time signals; use /proc/[pid]/status instead.
  • sigignore: The bitmap of ignored signals, displayed as a decimal number. Obsolete, because it does not provide information on real-time signals; use /proc/[pid]/status instead.
  • sigcatch: The bitmap of caught signals, displayed as a decimal number. Obsolete, because it does not provide information on real-time signals; use /proc/[pid]/status instead.
  • wchan: This is the "channel" in which the process is waiting. It is the address of a system call, and can be looked up in a namelist if you need a textual name. (If you have an up-to-date /etc/psdatabase, then try ps -l to see the WCHAN field in action.)
  • nswap: Number of pages swapped (not maintained).
  • cnswap: Cumulative nswap for child processes (not maintained).
  • exit_signal: (since Linux 2.1.22) Signal to be sent to parent when we die.
  • processor: (since Linux 2.2.8) CPU number last executed on.
  • rt_priority: (since Linux 2.5.19; was: before Linux 2.6.22) Real-time scheduling priority, a number in the range 1 to 99 for processes scheduled under a real-time policy, or 0, for non-real-time processes (see sched_setscheduler(2)).
  • policy: (since Linux 2.5.19; was: before Linux 2.6.22) Scheduling policy (see sched_setscheduler(2)). Decode using the SCHED_* constants in linux/sched.h.
  • delayacct_blkio_ticks: (since Linux 2.6.18) Aggregated block I/O delays, measured in clock ticks (centiseconds).
  • guest_time: (since Linux 2.6.24) Guest time of the process (time spent running a virtual CPU for a guest operating system), measured in clock ticks (divide by sysconf(_SC_CLK_TCK).
  • cguest_time:ld (since Linux 2.6.24) Guest time of the process's children, measured in clock ticks (divide by sysconf(_SC_CLK_TCK).

vmstat

procs ----------memory--------- --swap- ----io--- -system- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 1 696 1169444 442588 5495700 0 0 75 103 5 2 4 2 91 3 0
2 0 696 1125064 442788 5511968 0 0 4392 3036 3365 2532 7 6 79 8 0
1 2 696 1121304 442900 5515692 0 0 2656 420 2585 2754 3 6 85 6 0
4 3 696 1081844 443292 5533832 0 0 2036 10042 4874 4655 13 8 69 10 0

  • r: The number of processes waiting for run time
  • b: The number of processes in uninterruptible sleep

Labels

 
(None)
 

Life@Directi


From Blogs & Wikis

Directi Presentations

General Wikis

Directi Univ Wikis

Company Blogs

Businesses


TechCamp
Home.pw - Chat and collaboration for companies and individuals. LogicBoxes - Registry & Registrar Solutions ResellerClub - Domain Reseller, Domain Name Reseller, Cheap Domain Reseller - Resellers BigRock - Domain Names, Domain Registration India, Web Hosting, Domains Skenzo - Exclusive Traffic Monetization Programs WebHosting - Web Hosting Information CodeChef - Online Programming Competition
All content in the Directi Wiki is licensed under a Creative Commons Attribution-Share Alike 3.0 .