Merge pull request #94 from Seaworth/english

complete translation of common_knowledge/linuxProcess.md
This commit is contained in:
labuladong
2020-03-01 19:27:24 +08:00
committed by GitHub
10 changed files with 123 additions and 120 deletions

View File

@ -0,0 +1,123 @@
# What are Process, Thread, and File Descriptor in Linux?
**Translator: [Seaworth](https://github.com/Seaworth)**
**Author: [labuladong](https://github.com/labuladong)**
Speaking of process, I am afraid that the most common problem of interviews is the relationship between thread and process. The answer is: **In Linux systems, there is almost no difference between process and thread**.
A process of Linux is a data structure. You can clearly understand the underlying working principle of file descriptors, redirection, and pipeline commands. Finally, from the perspective of operating system, we can see why there is basically no difference between thread and process.
### 一、What is a process?
First, abstractly, our computer is this thing as follows:
![](../pictures/linuxProcess/1.jpg)
This large rectangle represents the computer's **memory space**, where the small rectangle represents **process**, the circle in the lower left corner represents **disk**, and the graph in the lower right corner represents some **input and output devices** , such as mouse, keyboard, monitor, etc. In addition, it is noted that the memory space is divided into two parts, the upper part represents **user space**, and the lower part represents **kernel space**.
User space holds the resources that the user process needs to use. For example, if you create an array in the program, this array must exist in user space. Kernel space stores system resources that the kernel process needs to load. These resources are generally not allowed to be accessed by users. But some user processes can share some kernel space resources, such as some dynamic link libraries and so on.
We write a hello program in C language, compile it to get an executable file, run it on the command line to display Hello World on the screen, and then exit the program. At the operating system level, a new process is created, which reads the executable file into memory space, executes it, and finally exits.
**The executable program you compiled is just a file**, not a process. The executable file must be loaded into memory and packed into a process to really run. Processes are created by the operating system. Each process has its inherent attributes, such as process ID (PID), process status, open files, etc. After the process is created, ti reads into your program and your program will be executed by the system.
So, how does the operating system create processes? **For the operating system, a process is a data structure**. Let's look directly at the Linux source code:
```cpp
struct task_struct {
// Process status
/* -1 unrunnable, 0 runnable, >0 stopped: */
long state;
// Virtual memory structure
struct mm_struct *mm;
// Process number
pid_t pid;
// Pointer to parent process
struct task_struct __rcu *parent;
// Children form the list of natural children:
struct list_head children;
// Pointer to filesystem information:
struct fs_struct *fs;
// Open file information:
struct files_struct *files;
};
```
`task_struct` is the description of a process by the Linux kernel, which can also be called `process descriptor`. The [source code](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) is more complicated. So I only intercepted a few common ones here.
The interesting ones are the `mm` pointer and the `files` pointer. The `mm` pointer refers to the virtual memory of the process, which is where the resources and executable files are loaded. The `files` pointer points to an array containing pointers to all files opened by the process.
### 二、What is a file descriptor?
Let's start with `files`, which is an array of file pointers. Generally, a process will read input from `files[0]`, write output to `files[1]`, and write error information to `files[2]`.
For example, from our perspective, the `printf` function in C is to print characters to the command line, but from the process perspective, it is to write data to `files[1]`. Similarly, the `scanf` function is that the process reads data from `files[0]`.
**When each process is created, the first three bits of `files` are filled with default values, which point to standard input stream, standard output stream, and standard error stream, respectively. We often say `file descriptor` refers to the index of this file pointer array **. So the file descriptor of the program by default : 0 represents standard input (stdin), 1 is standard output (stdout), 2 is standard error (stderr).
We can redraw a picture as follows:
![](../pictures/linuxProcess/2.jpg)
For general computers, input stream is the keyboard, output stream and error stream are both displays. So now this process is connected to the kernel with three wires. Because hardware resources are managed by the kernel, our process needs to let the kernel process access hardware resources through **system calls**.
PS: Don't forget, everything is abstracted into files in Linux. And devices are also files, which can be read and written.
If the program we wrote needs other resources, such as opening a file for reading and writing, this is also very simple. Make a system call and let the kernel open the file, and this file will be placed in the 4th position of `files`:
![](../pictures/linuxProcess/3.jpg)
Understand this principle, **input redirection** is easy to understand. When the program wants to read data, it will read `files[0]`. So we just point `files[0]` to a file. Then the program will read the data from this file instead of the keyboard. the **less-than character <** is used to redirect the input of a command.
```shell
$ command < file.txt
```
![](../pictures/linuxProcess/5.jpg)
Similarly, **output redirection** is to point `files[1]` to a file. So the output of the program will not be written to the display, but to this file. The **greater-than character >** is used for output redirection.
```shell
$ command > file.txt
```
![](../pictures/linuxProcess/4.jpg)
Error redirection is the same, so I will not go into details.
**Pipe symbol** is actually the same. It connects the output stream of one process with the input stream of another process, and the data is passed in it. I have to say that this design idea is really beautiful.
```shell
$ cmd1 | cmd2 | cmd3
```
![](../pictures/linuxProcess/6.jpg)
At this point, you may also see the clever design idea of **Everything is a file in Linux**. Whether it is a device, a process, a socket, or a real file, all of them can be read and written. And they are loaded into a simple `files` array. The specific details are delivered to the operating system, which is effectively decoupled, beautiful and efficient.
### 三、What is a thread?
The first thing to be clear is that multi-process and multi-thread can achieve concurrency to improve the utilization efficiency of the processor. So the key now is what's the difference between multi-thread and multi-process.
Why is there basically no difference between thread and process in Linux? From the perspective of the Linux kernel, thread and process are not treated differently.
We know that the system call `fork()` function can create a new child process. And the function `pthread()` can create a new thread. **But both thread and process are represented by the `task_struct` structure. The only difference is the shared data area**.
In other words, threads look no different from processes. It's just that some data areas of a thread are shared with its parent process. However, a child process is a copy, not a share. For example, the `mm` structure and the ` files` structure are shared across threads, I drew two pictures and you will understand.
![](../pictures/linuxProcess/7.jpg)
![](../pictures/linuxProcess/8.jpg)
Therefore, our multi-thread program should use the lock mechanism to avoid multiple threads writing data to the same area at the same time. Otherwise, data may be disordered.
Then you may ask, **Since processes and threads are similar, and multi-process data is not shared, that is, there is no data disorder problem. Why is multi-thread use more common than multi-process?**
Because in reality the concurrency of data sharing is more common. For example, ten people take ten yuan from one account at the same time. What we hope is that the balance of this shared account will be reduced by exactly one hundred yuan. Instead, each person gets a copy of the account, and each copy account is reduced by ten yuan.
Of course, it must be explained that only Linux systems treat thread as process that shares data, and do not treat them specifically, do not treat thread and process differently. Many other operating systems treat thread and process differently. Threads have their own unique data structures. I personally think that this design is not as concise as Linux and increases the complexity of the system.
Creating threads and processes are very efficient in Linux. For the problem of memory area copy, Linux uses the copy-on-write optimization strategy when creating a process. The memory space of parent process is not actually copied, but only copied during the write operation. **So creating processes and threads in Linux are very fast**.
Stick to original high-quality articles, committed to making algorithmic problems clear. Welcome to follow us on WeChat public account **labuladong** for latest articles.

View File

@ -1,120 +0,0 @@
# Linux的进程、线程、文件描述符是什么
说到进程,恐怕面试中最常见的问题就是线程和进程的关系了,那么先说一下答案:**在 Linux 系统中,进程和线程几乎没有区别**。
Linux 中的进程就是一个数据结构,看明白就可以理解文件描述符、重定向、管道命令的底层工作原理,最后我们从操作系统的角度看看为什么说线程和进程基本没有区别。
### 一、进程是什么
首先,抽象地来说,我们的计算机就是这个东西:
![](../pictures/linuxProcess/1.jpg)
这个大的矩形表示计算机的**内存空间**,其中的小矩形代表**进程**,左下角的圆形表示**磁盘**,右下角的图形表示一些**输入输出设备**,比如鼠标键盘显示器等等。另外,注意到内存空间被划分为了两块,上半部分表示**用户空间**,下半部分表示**内核空间**。
用户空间装着用户进程需要使用的资源,比如你在程序代码里开一个数组,这个数组肯定存在用户空间;内核空间存放内核进程需要加载的系统资源,这一些资源一般是不允许用户访问的。但是注意有的用户进程会共享一些内核空间的资源,比如一些动态链接库等等。
我们用 C 语言写一个 hello 程序,编译后得到一个可执行文件,在命令行运行就可以打印出一句 hello world然后程序退出。在操作系统层面就是新建了一个进程这个进程将我们编译出来的可执行文件读入内存空间然后执行最后退出。
**你编译好的那个可执行程序只是一个文件**不是进程可执行文件必须要载入内存包装成一个进程才能真正跑起来。进程是要依靠操作系统创建的每个进程都有它的固有属性比如进程号PID、进程状态、打开的文件等等进程创建好之后读入你的程序你的程序才被系统执行。
那么,操作系统是如何创建进程的呢?**对于操作系统,进程就是一个数据结构**,我们直接来看 Linux 的源码:
```cpp
struct task_struct {
// 进程状态
long state;
// 虚拟内存结构体
struct mm_struct *mm;
// 进程号
pid_t pid;
// 指向父进程的指针
struct task_struct __rcu *parent;
// 子进程列表
struct list_head children;
// 存放文件系统信息的指针
struct fs_struct *fs;
// 一个数组,包含该进程打开的文件指针
struct files_struct *files;
};
```
`task_struct`就是 Linux 内核对于一个进程的描述,也可以称为「进程描述符」。源码比较复杂,我这里就截取了一小部分比较常见的。
其中比较有意思的是`mm`指针和`files`指针。`mm`指向的是进程的虚拟内存,也就是载入资源和可执行文件的地方;`files`指针指向一个数组,这个数组里装着所有该进程打开的文件的指针。
### 二、文件描述符是什么
先说`files`,它是一个文件指针数组。一般来说,一个进程会从`files[0]`读取输入,将输出写入`files[1]`,将错误信息写入`files[2]`
举个例子,以我们的角度 C 语言的`printf`函数是向命令行打印字符,但是从进程的角度来看,就是向`files[1]`写入数据;同理,`scanf`函数就是进程试图从`files[0]`这个文件中读取数据。
**每个进程被创建时,`files`的前三位被填入默认值,分别指向标准输入流、标准输出流、标准错误流。我们常说的「文件描述符」就是指这个文件指针数组的索引**,所以程序的文件描述符默认情况下 0 是输入1 是输出2 是错误。
我们可以重新画一幅图:
![](../pictures/linuxProcess/2.jpg)
对于一般的计算机,输入流是键盘,输出流是显示器,错误流也是显示器,所以现在这个进程和内核连了三根线。因为硬件都是由内核管理的,我们的进程需要通过「系统调用」让内核进程访问硬件资源。
PS不要忘了Linux 中一切都被抽象成文件,设备也是文件,可以进行读和写。
如果我们写的程序需要其他资源,比如打开一个文件进行读写,这也很简单,进行系统调用,让内核把文件打开,这个文件就会被放到`files`的第 4 个位置:
![](../pictures/linuxProcess/3.jpg)
明白了这个原理,**输入重定向**就很好理解了,程序想读取数据的时候就会去`files[0]`读取,所以我们只要把`files[0]`指向一个文件,那么程序就会从这个文件中读取数据,而不是从键盘:
```shell
$ command < file.txt
```
![](../pictures/linuxProcess/5.jpg)
同理,**输出重定向**就是把`files[1]`指向一个文件,那么程序的输出就不会写入到显示器,而是写入到这个文件中:
```shell
$ command > file.txt
```
![](../pictures/linuxProcess/4.jpg)
错误重定向也是一样的,就不再赘述。
**管道符**其实也是异曲同工,把一个进程的输出流和另一个进程的输入流接起一条「管道」,数据就在其中传递,不得不说这种设计思想真的很优美:
```shell
$ cmd1 | cmd2 | cmd3
```
![](../pictures/linuxProcess/6.jpg)
到这里你可能也看出「Linux 中一切皆文件」设计思路的高明了不管是设备、另一个进程、socket 套接字还是真正的文件,全部都可以读写,统一装进一个简单的`files`数组,进程通过简单的文件描述符访问相应资源,具体细节交于操作系统,有效解耦,优美高效。
### 三、线程是什么
首先要明确的是,多进程和多线程都是并发,都可以提高处理器的利用效率,所以现在的关键是,多线程和多进程有啥区别。
为什么说 Linux 中线程和进程基本没有区别呢,因为从 Linux 内核的角度来看,并没有把线程和进程区别对待。
我们知道系统调用`fork()`可以新建一个子进程,函数`pthread()`可以新建一个线程。**但无论线程还是进程,都是用`task_struct`结构表示的,唯一的区别就是共享的数据区域不同**。
换句话说,线程看起来跟进程没有区别,只是线程的某些数据区域和其父进程是共享的,而子进程是拷贝副本,而不是共享。就比如说,`mm`结构和`files`结构在线程中都是共享的,我画两张图你就明白了:
![](../pictures/linuxProcess/7.jpg)
![](../pictures/linuxProcess/8.jpg)
所以说,我们的多线程程序要利用锁机制,避免多个线程同时往同一区域写入数据,否则可能造成数据错乱。
那么你可能问,**既然进程和线程差不多,而且多进程数据不共享,即不存在数据错乱的问题,为什么多线程的使用比多进程普遍得多呢**
因为现实中数据共享的并发更普遍呀,比如十个人同时从一个账户取十元,我们希望的是这个共享账户的余额正确减少一百元,而不是希望每人获得一个账户的拷贝,每个拷贝账户减少十元。
当然,必须要说明的是,只有 Linux 系统将线程看做共享数据的进程,不对其做特殊看待,其他的很多操作系统是对线程和进程区别对待的,线程有其特有的数据结构,我个人认为不如 Linux 的这种设计简洁,增加了系统的复杂度。
在 Linux 中新建线程和进程的效率都是很高的对于新建进程时内存区域拷贝的问题Linux 采用了 copy-on-write 的策略优化,也就是并不真正复制父进程的内存空间,而是等到需要写操作时才去复制。**所以 Linux 中新建进程和新建线程都是很迅速的**。
坚持原创高质量文章,致力于把算法问题讲清楚,欢迎关注我的公众号 labuladong 获取最新文章:
![labuladong](../pictures/labuladong.jpg)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 139 KiB

After

Width:  |  Height:  |  Size: 154 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 154 KiB

After

Width:  |  Height:  |  Size: 172 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 159 KiB

After

Width:  |  Height:  |  Size: 179 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 156 KiB

After

Width:  |  Height:  |  Size: 175 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 154 KiB

After

Width:  |  Height:  |  Size: 173 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 159 KiB

After

Width:  |  Height:  |  Size: 178 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 190 KiB

After

Width:  |  Height:  |  Size: 222 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 178 KiB

After

Width:  |  Height:  |  Size: 209 KiB