mirror of
https://github.com/krahets/hello-algo.git
synced 2025-07-29 21:33:07 +08:00
build
This commit is contained in:
@ -2,7 +2,7 @@
|
||||
comments: true
|
||||
---
|
||||
|
||||
# 3.2 Basic Data Types
|
||||
# 3.2 Basic data types
|
||||
|
||||
When discussing data in computers, various forms like text, images, videos, voice and 3D models comes to mind. Despite their different organizational forms, they are all composed of various basic data types.
|
||||
|
||||
@ -22,7 +22,7 @@ The range of values for basic data types depends on the size of the space they o
|
||||
|
||||
The following table lists the space occupied, value range, and default values of various basic data types in Java. While memorizing this table isn't necessary, having a general understanding of it and referencing it when required is recommended.
|
||||
|
||||
<p align="center"> Table 3-1 Space Occupied and Value Range of Basic Data Types </p>
|
||||
<p align="center"> Table 3-1 Space occupied and value range of basic data types </p>
|
||||
|
||||
<div class="center-table" markdown>
|
||||
|
||||
|
@ -2,29 +2,29 @@
|
||||
comments: true
|
||||
---
|
||||
|
||||
# 3.4 Character Encoding *
|
||||
# 3.4 Character encoding *
|
||||
|
||||
In the computer system, all data is stored in binary form, and characters (represented by char) are no exception. To represent characters, we need to develop a "character set" that defines a one-to-one mapping between each character and binary numbers. With the character set, computers can convert binary numbers to characters by looking up the table.
|
||||
|
||||
## 3.4.1 ASCII Character Set
|
||||
## 3.4.1 ASCII character set
|
||||
|
||||
The "ASCII code" is one of the earliest character sets, officially known as the American Standard Code for Information Interchange. It uses 7 binary digits (the lower 7 bits of a byte) to represent a character, allowing for a maximum of 128 different characters. As shown in the Figure 3-6 , ASCII includes uppercase and lowercase English letters, numbers 0 ~ 9, various punctuation marks, and certain control characters (such as newline and tab).
|
||||
|
||||
{ class="animation-figure" }
|
||||
{ class="animation-figure" }
|
||||
|
||||
<p align="center"> Figure 3-6 ASCII Code </p>
|
||||
<p align="center"> Figure 3-6 ASCII code </p>
|
||||
|
||||
However, **ASCII can only represent English characters**. With the globalization of computers, a character set called "EASCII" was developed to represent more languages. It expands from the 7-bit structure of ASCII to 8 bits, enabling the representation of 256 characters.
|
||||
|
||||
Globally, various region-specific EASCII character sets have been introduced. The first 128 characters of these sets are consistent with the ASCII, while the remaining 128 characters are defined differently to accommodate the requirements of different languages.
|
||||
|
||||
## 3.4.2 GBK Character Set
|
||||
## 3.4.2 GBK character set
|
||||
|
||||
Later, it was found that **EASCII still could not meet the character requirements of many languages**. For instance, there are nearly a hundred thousand Chinese characters, with several thousand used regularly. In 1980, the Standardization Administration of China released the "GB2312" character set, which included 6763 Chinese characters, essentially fulfilling the computer processing needs for the Chinese language.
|
||||
|
||||
However, GB2312 could not handle some rare and traditional characters. The "GBK" character set expands GB2312 and includes 21886 Chinese characters. In the GBK encoding scheme, ASCII characters are represented with one byte, while Chinese characters use two bytes.
|
||||
|
||||
## 3.4.3 Unicode Character Set
|
||||
## 3.4.3 Unicode character set
|
||||
|
||||
With the rapid evolution of computer technology and a plethora of character sets and encoding standards, numerous problems arose. On the one hand, these character sets generally only defined characters for specific languages and could not function properly in multilingual environments. On the other hand, the existence of multiple character set standards for the same language caused garbled text when information was exchanged between computers using different encoding standards.
|
||||
|
||||
@ -38,13 +38,13 @@ Unicode is a universal character set that assigns a number (called a "code point
|
||||
|
||||
A straightforward solution to this problem is to store all characters as equal-length encodings. As shown in the Figure 3-7 , each character in "Hello" occupies 1 byte, while each character in "算法" (algorithm) occupies 2 bytes. We could encode all characters in "Hello 算法" as 2 bytes by padding the higher bits with zeros. This method would enable the system to interpret a character every 2 bytes, recovering the content of the phrase.
|
||||
|
||||
{ class="animation-figure" }
|
||||
{ class="animation-figure" }
|
||||
|
||||
<p align="center"> Figure 3-7 Unicode Encoding Example </p>
|
||||
<p align="center"> Figure 3-7 Unicode encoding example </p>
|
||||
|
||||
However, as ASCII has shown us, encoding English only requires 1 byte. Using the above approach would double the space occupied by English text compared to ASCII encoding, which is a waste of memory space. Therefore, a more efficient Unicode encoding method is needed.
|
||||
|
||||
## 3.4.4 UTF-8 Encoding
|
||||
## 3.4.4 UTF-8 encoding
|
||||
|
||||
Currently, UTF-8 has become the most widely used Unicode encoding method internationally. **It is a variable-length encoding**, using 1 to 4 bytes to represent a character, depending on the complexity of the character. ASCII characters need only 1 byte, Latin and Greek letters require 2 bytes, commonly used Chinese characters need 3 bytes, and some other rare characters need 4 bytes.
|
||||
|
||||
@ -59,26 +59,26 @@ But why set the highest 2 bits of the remaining bytes to $10$? Actually, this $1
|
||||
|
||||
The reason for using $10$ as a checksum is that, under UTF-8 encoding rules, it's impossible for the highest two bits of a character to be $10$. This can be proven by contradiction: If the highest two bits of a character are $10$, it indicates that the character's length is $1$, corresponding to ASCII. However, the highest bit of an ASCII character should be $0$, which contradicts the assumption.
|
||||
|
||||
{ class="animation-figure" }
|
||||
{ class="animation-figure" }
|
||||
|
||||
<p align="center"> Figure 3-8 UTF-8 Encoding Example </p>
|
||||
<p align="center"> Figure 3-8 UTF-8 encoding example </p>
|
||||
|
||||
Apart from UTF-8, other common encoding methods include:
|
||||
|
||||
- **UTF-16 Encoding**: Uses 2 or 4 bytes to represent a character. All ASCII characters and commonly used non-English characters are represented with 2 bytes; a few characters require 4 bytes. For 2-byte characters, the UTF-16 encoding equals the Unicode code point.
|
||||
- **UTF-32 Encoding**: Every character uses 4 bytes. This means UTF-32 occupies more space than UTF-8 and UTF-16, especially for texts with a high proportion of ASCII characters.
|
||||
- **UTF-16 encoding**: Uses 2 or 4 bytes to represent a character. All ASCII characters and commonly used non-English characters are represented with 2 bytes; a few characters require 4 bytes. For 2-byte characters, the UTF-16 encoding equals the Unicode code point.
|
||||
- **UTF-32 encoding**: Every character uses 4 bytes. This means UTF-32 occupies more space than UTF-8 and UTF-16, especially for texts with a high proportion of ASCII characters.
|
||||
|
||||
From the perspective of storage space, using UTF-8 to represent English characters is very efficient because it only requires 1 byte; using UTF-16 to encode some non-English characters (such as Chinese) can be more efficient because it only requires 2 bytes, while UTF-8 might need 3 bytes.
|
||||
|
||||
From a compatibility perspective, UTF-8 is the most versatile, with many tools and libraries supporting UTF-8 as a priority.
|
||||
|
||||
## 3.4.5 Character Encoding in Programming Languages
|
||||
## 3.4.5 Character encoding in programming languages
|
||||
|
||||
Historically, many programming languages utilized fixed-length encodings such as UTF-16 or UTF-32 for processing strings during program execution. This allows strings to be handled as arrays, offering several advantages:
|
||||
|
||||
- **Random Access**: Strings encoded in UTF-16 can be accessed randomly with ease. For UTF-8, which is a variable-length encoding, locating the $i^{th}$ character requires traversing the string from the start to the $i^{th}$ position, taking $O(n)$ time.
|
||||
- **Character Counting**: Similar to random access, counting the number of characters in a UTF-16 encoded string is an $O(1)$ operation. However, counting characters in a UTF-8 encoded string requires traversing the entire string.
|
||||
- **String Operations**: Many string operations like splitting, concatenating, inserting, and deleting are easier on UTF-16 encoded strings. These operations generally require additional computation on UTF-8 encoded strings to ensure the validity of the UTF-8 encoding.
|
||||
- **Random access**: Strings encoded in UTF-16 can be accessed randomly with ease. For UTF-8, which is a variable-length encoding, locating the $i^{th}$ character requires traversing the string from the start to the $i^{th}$ position, taking $O(n)$ time.
|
||||
- **Character counting**: Similar to random access, counting the number of characters in a UTF-16 encoded string is an $O(1)$ operation. However, counting characters in a UTF-8 encoded string requires traversing the entire string.
|
||||
- **String operations**: Many string operations like splitting, concatenating, inserting, and deleting are easier on UTF-16 encoded strings. These operations generally require additional computation on UTF-8 encoded strings to ensure the validity of the UTF-8 encoding.
|
||||
|
||||
The design of character encoding schemes in programming languages is an interesting topic involving various factors:
|
||||
|
||||
|
@ -2,38 +2,38 @@
|
||||
comments: true
|
||||
---
|
||||
|
||||
# 3.1 Classification of Data Structures
|
||||
# 3.1 Classification of data structures
|
||||
|
||||
Common data structures include arrays, linked lists, stacks, queues, hash tables, trees, heaps, and graphs. They can be classified into "logical structure" and "physical structure".
|
||||
|
||||
## 3.1.1 Logical Structure: Linear and Non-Linear
|
||||
## 3.1.1 Logical structure: linear and non-linear
|
||||
|
||||
**The logical structures reveal the logical relationships between data elements**. In arrays and linked lists, data are arranged in a specific sequence, demonstrating the linear relationship between data; while in trees, data are arranged hierarchically from the top down, showing the derived relationship between "ancestors" and "descendants"; and graphs are composed of nodes and edges, reflecting the intricate network relationship.
|
||||
|
||||
As shown in the Figure 3-1 , logical structures can be divided into two major categories: "linear" and "non-linear". Linear structures are more intuitive, indicating data is arranged linearly in logical relationships; non-linear structures, conversely, are arranged non-linearly.
|
||||
|
||||
- **Linear Data Structures**: Arrays, Linked Lists, Stacks, Queues, Hash Tables.
|
||||
- **Non-Linear Data Structures**: Trees, Heaps, Graphs, Hash Tables.
|
||||
- **Linear data structures**: Arrays, Linked Lists, Stacks, Queues, Hash Tables.
|
||||
- **Non-linear data structures**: Trees, Heaps, Graphs, Hash Tables.
|
||||
|
||||
{ class="animation-figure" }
|
||||
{ class="animation-figure" }
|
||||
|
||||
<p align="center"> Figure 3-1 Linear and Non-Linear Data Structures </p>
|
||||
<p align="center"> Figure 3-1 Linear and non-linear data structures </p>
|
||||
|
||||
Non-linear data structures can be further divided into tree structures and network structures.
|
||||
|
||||
- **Linear Structures**: Arrays, linked lists, queues, stacks, and hash tables, where elements have a one-to-one sequential relationship.
|
||||
- **Tree Structures**: Trees, Heaps, Hash Tables, where elements have a one-to-many relationship.
|
||||
- **Network Structures**: Graphs, where elements have a many-to-many relationships.
|
||||
- **Linear structures**: Arrays, linked lists, queues, stacks, and hash tables, where elements have a one-to-one sequential relationship.
|
||||
- **Tree structures**: Trees, Heaps, Hash Tables, where elements have a one-to-many relationship.
|
||||
- **Network structures**: Graphs, where elements have a many-to-many relationships.
|
||||
|
||||
## 3.1.2 Physical Structure: Contiguous and Dispersed
|
||||
## 3.1.2 Physical structure: contiguous and dispersed
|
||||
|
||||
**During the execution of an algorithm, the data being processed is stored in memory**. The Figure 3-2 shows a computer memory stick where each black square is a physical memory space. We can think of memory as a vast Excel spreadsheet, with each cell capable of storing a certain amount of data.
|
||||
|
||||
**The system accesses the data at the target location by means of a memory address**. As shown in the Figure 3-2 , the computer assigns a unique identifier to each cell in the table according to specific rules, ensuring that each memory space has a unique memory address. With these addresses, the program can access the data stored in memory.
|
||||
|
||||
{ class="animation-figure" }
|
||||
{ class="animation-figure" }
|
||||
|
||||
<p align="center"> Figure 3-2 Memory Stick, Memory Spaces, Memory Addresses </p>
|
||||
<p align="center"> Figure 3-2 Memory stick, memory spaces, memory addresses </p>
|
||||
|
||||
!!! tip
|
||||
|
||||
@ -43,9 +43,9 @@ Memory is a shared resource for all programs. When a block of memory is occupied
|
||||
|
||||
As illustrated in the Figure 3-3 , **the physical structure reflects the way data is stored in computer memory** and it can be divided into contiguous space storage (arrays) and non-contiguous space storage (linked lists). The two types of physical structures exhibit complementary characteristics in terms of time efficiency and space efficiency.
|
||||
|
||||
{ class="animation-figure" }
|
||||
{ class="animation-figure" }
|
||||
|
||||
<p align="center"> Figure 3-3 Contiguous Space Storage and Dispersed Space Storage </p>
|
||||
<p align="center"> Figure 3-3 Contiguous space storage and dispersed space storage </p>
|
||||
|
||||
**It is worth noting that all data structures are implemented based on arrays, linked lists, or a combination of both**. For example, stacks and queues can be implemented using either arrays or linked lists; while implementations of hash tables may involve both arrays and linked lists.
|
||||
- **Array-based implementations**: Stacks, Queues, Hash Tables, Trees, Heaps, Graphs, Matrices, Tensors (arrays with dimensions $\geq 3$).
|
||||
|
@ -3,9 +3,9 @@ comments: true
|
||||
icon: material/shape-outline
|
||||
---
|
||||
|
||||
# Chapter 3. Data Structures
|
||||
# Chapter 3. Data structures
|
||||
|
||||
{ class="cover-image" }
|
||||
{ class="cover-image" }
|
||||
|
||||
!!! abstract
|
||||
|
||||
@ -15,8 +15,8 @@ icon: material/shape-outline
|
||||
|
||||
## Chapter Contents
|
||||
|
||||
- [3.1 Classification of Data Structures](https://www.hello-algo.com/en/chapter_data_structure/classification_of_data_structure/)
|
||||
- [3.2 Fundamental Data Types](https://www.hello-algo.com/en/chapter_data_structure/basic_data_types/)
|
||||
- [3.3 Number Encoding *](https://www.hello-algo.com/en/chapter_data_structure/number_encoding/)
|
||||
- [3.4 Character Encoding *](https://www.hello-algo.com/en/chapter_data_structure/character_encoding/)
|
||||
- [3.1 Classification of data structures](https://www.hello-algo.com/en/chapter_data_structure/classification_of_data_structure/)
|
||||
- [3.2 Fundamental data types](https://www.hello-algo.com/en/chapter_data_structure/basic_data_types/)
|
||||
- [3.3 Number encoding *](https://www.hello-algo.com/en/chapter_data_structure/number_encoding/)
|
||||
- [3.4 Character encoding *](https://www.hello-algo.com/en/chapter_data_structure/character_encoding/)
|
||||
- [3.5 Summary](https://www.hello-algo.com/en/chapter_data_structure/summary/)
|
||||
|
@ -2,13 +2,13 @@
|
||||
comments: true
|
||||
---
|
||||
|
||||
# 3.3 Number Encoding *
|
||||
# 3.3 Number encoding *
|
||||
|
||||
!!! note
|
||||
|
||||
In this book, chapters marked with an asterisk '*' are optional readings. If you are short on time or find them challenging, you may skip these initially and return to them after completing the essential chapters.
|
||||
|
||||
## 3.3.1 Integer Encoding
|
||||
## 3.3.1 Integer encoding
|
||||
|
||||
In the table from the previous section, we observed that all integer types can represent one more negative number than positive numbers, such as the `byte` range of $[-128, 127]$. This phenomenon seems counterintuitive, and its underlying reason involves knowledge of sign-magnitude, one's complement, and two's complement encoding.
|
||||
|
||||
@ -20,9 +20,9 @@ Firstly, it's important to note that **numbers are stored in computers using the
|
||||
|
||||
The following diagram illustrates the conversions among sign-magnitude, one's complement, and two's complement:
|
||||
|
||||
{ class="animation-figure" }
|
||||
{ class="animation-figure" }
|
||||
|
||||
<p align="center"> Figure 3-4 Conversions between Sign-Magnitude, One's Complement, and Two's Complement </p>
|
||||
<p align="center"> Figure 3-4 Conversions between sign-magnitude, one's complement, and two's complement </p>
|
||||
|
||||
Although sign-magnitude is the most intuitive, it has limitations. For one, **negative numbers in sign-magnitude cannot be directly used in calculations**. For example, in sign-magnitude, calculating $1 + (-2)$ results in $-3$, which is incorrect.
|
||||
|
||||
@ -92,7 +92,7 @@ We can now summarize the reason for using two's complement in computers: with tw
|
||||
|
||||
The design of two's complement is quite ingenious, and due to space constraints, we'll stop here. Interested readers are encouraged to explore further.
|
||||
|
||||
## 3.3.2 Floating-Point Number Encoding
|
||||
## 3.3.2 Floating-point number encoding
|
||||
|
||||
You might have noticed something intriguing: despite having the same length of 4 bytes, why does a `float` have a much larger range of values compared to an `int`? This seems counterintuitive, as one would expect the range to shrink for `float` since it needs to represent fractions.
|
||||
|
||||
@ -129,9 +129,9 @@ $$
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
{ class="animation-figure" }
|
||||
{ class="animation-figure" }
|
||||
|
||||
<p align="center"> Figure 3-5 Example Calculation of a float in IEEE 754 Standard </p>
|
||||
<p align="center"> Figure 3-5 Example calculation of a float in IEEE 754 standard </p>
|
||||
|
||||
Observing the diagram, given an example data $\mathrm{S} = 0$, $\mathrm{E} = 124$, $\mathrm{N} = 2^{-2} + 2^{-3} = 0.375$, we have:
|
||||
|
||||
@ -145,7 +145,7 @@ Now we can answer the initial question: **The representation of `float` includes
|
||||
|
||||
As shown in the Table 3-2 , exponent bits $E = 0$ and $E = 255$ have special meanings, **used to represent zero, infinity, $\mathrm{NaN}$, etc.**
|
||||
|
||||
<p align="center"> Table 3-2 Meaning of Exponent Bits </p>
|
||||
<p align="center"> Table 3-2 Meaning of exponent bits </p>
|
||||
|
||||
<div class="center-table" markdown>
|
||||
|
||||
|
@ -4,7 +4,7 @@ comments: true
|
||||
|
||||
# 3.5 Summary
|
||||
|
||||
### 1. Key Review
|
||||
### 1. Key review
|
||||
|
||||
- Data structures can be categorized from two perspectives: logical structure and physical structure. Logical structure describes the logical relationships between data elements, while physical structure describes how data is stored in computer memory.
|
||||
- Common logical structures include linear, tree-like, and network structures. We generally classify data structures into linear (arrays, linked lists, stacks, queues) and non-linear (trees, graphs, heaps) based on their logical structure. The implementation of hash tables may involve both linear and non-linear data structures.
|
||||
|
Reference in New Issue
Block a user