Building web apps from scratch - Request Parsing - Part 2
NOTE: This post is still a work in progress!! Our little web server is very cool, isn't it? The missing thing now, from a feature stand point, is sending a response based on what the client requested. But first, we need to improve its robustness a bit. There are some error corner cases that we haven't handled yet.
Handling partial sends
The first issue is relative to recv and send. We assumed that when we pass a buffer to send, the system will send the entire message or fail. This isn't correct. This function can also do partial sends. Lets say we want to send 100 bytes. The first time we send only 10 bytes may be sent, so we need to call send again with the remaining bytes
1 | int send_all(SOCKET_TYPE sock, void *src, size_t num) |
2 | { |
3 | size_t sent = 0; |
4 | while (sent < num) { |
5 | int just_sent = send(sock, (char*) src + sent, num - sent, 0); |
6 | if (just_sent < 0) return -1; |
7 | sent += (size_t) just_sent; |
8 | } |
9 | return sent; |
10 | } |
11 |
This function calls send
multiple times until every byte has been sent. If an error occurs before sending all bytes, it returns early with -1. By using this function in place of send
we made our server much more reliable!
The recv
function behaves in a similar way. Instead of worrying about sending only part of the bytes we need to worry about not receiving exactly how many bytes we had in our buffer. But since we are ignoring the incoming message for now, it does not make a difference.
The syntax of an HTTP request
Now we are ready to parse the client's request. This will allow us to return different responses based on what the client sent us. The proper way to go about this would be reading the HTTP specification and implement the parser accordingly, but starting with an approximation is good enough for now. Here's an example of HTTP request:
GET / HTTP/1.1
Host: 127.0.0.1:8080
Connection: keep-alive
sec-ch-ua: "Not(A:Brand";v="99", "Brave";v="133", "Chromium";v="133"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8
Sec-GPC: 1
Accept-Language: en-US,en;q=0.6
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br, zstd
I obtained this by simply adding a print statement in our server after recv
(don't forget the null terminator!).
The first line contains three things:
- method
- resource path
- HTTP version
The method changes how the request is handled. The most common methods are
- GET: The client is asking us for a resource
- POST: The client is sending us some data
this is a bit simplistic, but we can focus on this later.
You can think of a resource path as the file name the client is requesting. Though the resource may be something other than a file, for instance we could have a resource called /users
which returns a list of users returned by the database.
The version we'll support for now is HTTP/1.0
, but this does not matter as version 1 and 1.1 use the same syntax.
The first line terminates with a new line, which is made with a carriage return and newline character. In C, these characters are written as \r\n
. After that, we have a list of headers, strings in the form:
name: value\r\n
we can mostly ignore these and only add support for specific ones as we need them. The one we are interested in is the Content-Length
header, which tells us the length in bytes of the request payload. GET
requests don't have a payload, so this header is not used. We will need it for POST
requests.
After the entire header list, we find a \r\n
character, not considering the \r\n
used to close the last header. If the request had a payload, this is where it would start.
Just, to give you an example, this is how a POST
request would look like:
POST /some-resource HTTP/1.0\r\n
host: 127.0.0.1\r\n
Content-Length: 15\r\n
\r\n
I'm the payload
Parsing a request
Now we start writing the parser. There are many ways one can go about parsing, This is the way I found is easier for HTTP. The basic idea is we have a pointer to the request bytes, a length, and a cursor. We read the bytes by advancing the cursor and extract information from it by adding it to a structure. When the cursor reaches the end, we have a structure with all the information from the request bytes, easily accessible for our server. The request structure looks like this:
1 | typedef struct { char *data; int size; } string; |
2 | |
3 | #define MAX_HEADERS 32 |
4 | |
5 | typedef enum { |
6 | HTTP_METHOD_GET, |
7 | HTTP_METHOD_POST, |
8 | } HTTPMethod; |
9 | |
10 | typedef struct { |
11 | string name; |
12 | string value; |
13 | } HTTPHeader; |
14 | |
15 | typedef struct { |
16 | int major; |
17 | int minor; |
18 | } HTTPVersion; |
19 | |
20 | typedef struct { |
21 | HTTPMethod method; |
22 | string resource_path; |
23 | HTTPVersion version; |
24 | HTTPHeader headers[MAX_HEADERS]; |
25 | int num_headers; |
26 | } HTTPRequest; |
27 |
Notice how we created a helper structure for strings. This will come in handy for the entire project! The parsing function will use this interface:
1 | bool parse_request(string src, HTTPRequest *dst) |
2 | { |
3 | int cur = 0; |
4 | |
5 | // .. parsing code goes here .. |
6 | } |
7 |
if we parsed the request successfully, true is returned. If some error occurred, then false is returned. When the we succede, the dst
argument will be initialized.
First, we parse the method. If the request doesn't start with GET
or POST
, we consider it an error:
1 | if (3 < src.size |
2 | && src.data[0] == 'G' |
3 | && src.data[1] == 'E' |
4 | && src.data[2] == 'T' |
5 | && src.data[3] == ' ') { |
6 | dst->method = HTTP_METHOD_GET; |
7 | cur = 4; |
8 | } else if (4 < src.size |
9 | && src.data[0] == 'P' |
10 | && src.data[1] == 'O' |
11 | && src.data[2] == 'S' |
12 | && src.data[3] == 'T' |
13 | && src.data[4] == ' ') { |
14 | dst->method = HTTP_METHOD_POST; |
15 | cur = 5; |
16 | } else { |
17 | // Invalid method |
18 | return false; |
19 | } |
20 |
After this block, the cursor will point to the character that comes after the the first space, so the first character of the request path. The path goes from the first space to the second space.
1 | // Check that there is at least one non-space character where the cursor points |
2 | if (cur == src.size || src.data[cur] == ' ') |
3 | return false; // No path |
4 | |
5 | // Save the offset of the path in the string |
6 | int path_offset = cur; |
7 | |
8 | // The first character is not a space. Now loop until we find one |
9 | do |
10 | cur++; |
11 | while (cur < src.size && src.data[cur] != ' '); |
12 | |
13 | // There are two ways we exit the loop: |
14 | // 1) The cursor reached the end of the string because no space |
15 | // was found (cur == src.size) |
16 | // 2) We found a space (src.data[cur] == ' ') |
17 | // Of course (1) is an error |
18 | |
19 | if (cur == src.size) |
20 | return false; |
21 | |
22 | int path_length = cur - path_offset; |
23 | |
24 | // Consume the space that comes after the path |
25 | cur++; |
26 | |
27 | dst->resource_path = (string) { .data = src.data + path_offset, path_length }; |
28 |
Instead creating a copy of the resource path to store it in the HTTPRequest
structure, we created a string that pointed inside the input buffer. With this trick we avoided a dynamic allocation. The downside of this is that the contents of HTTPRequest
now depend on the input buffer staying around.
Now we parse the version. We expect one of the following strings: HTTP/1
, HTTP/1.0
, HTTP/1.1
, followed by \r\n
1 | // If we don't find the string "HTTP/", that's an error |
2 | if (4 >= src.size - cur |
3 | || src.data[cur+0] != 'H' |
4 | || src.data[cur+1] != 'T' |
5 | || src.data[cur+2] != 'T' |
6 | || src.data[cur+3] != 'P' |
7 | || src.data[cur+4] != '/') |
8 | return false; |
9 | cur += 5; |
10 | |
11 | // Now we expect either "1\r\n", "1.0\r\n", or "1.1\r\n" |
12 | if (4 < src.size - cur |
13 | && src.data[cur+0] == '1' |
14 | && src.data[cur+1] == '.' |
15 | && src.data[cur+2] == '1' |
16 | && src.data[cur+3] == '\r' |
17 | && src.data[cur+4] == '\n') { |
18 | cur += 5; |
19 | dst->version = (HTTPVersion) {1, 1}; |
20 | } else if (4 < src.size - cur |
21 | && src.data[cur+0] == '1' |
22 | && src.data[cur+1] == '.' |
23 | && src.data[cur+2] == '0' |
24 | && src.data[cur+3] == '\r' |
25 | && src.data[cur+4] == '\n') { |
26 | cur += 5; |
27 | dst->version = (HTTPVersion) {1, 0}; |
28 | } else if (2 < src.size - cur |
29 | && src.data[cur+0] == '1' |
30 | && src.data[cur+1] == '\r' |
31 | && src.data[cur+2] == '\n') { |
32 | cur += 3; |
33 | dst->version = (HTTPVersion) {1, 0}; |
34 | } else { |
35 | // Invalid version |
36 | return false; |
37 | } |
38 |
And that was the first line. Now comes the easy part! Now the cursor points to the first character of the list of headers. We must consume headers until we find the final \r\n
which denotes the end of the request head.
1 | // Initialize the header array |
2 | dst->num_headers = 0; |
3 | |
4 | // Loop until we find the final \r\n |
5 | while (1 >= src.size - cur |
6 | || src.data[cur+0] != '\r' |
7 | || src.data[cur+1] != '\n') { |
8 | |
9 | // The cursor now points to the first character of the header's name |
10 | int name_offset = cur; |
11 | |
12 | // Consume characters until we get to the separator |
13 | while (cur < src.size && src.data[cur] != ':') |
14 | cur++; |
15 | if (cur == src.size) |
16 | return false; // Cursor reached the end of the string |
17 | string header_name = { src.data + name_offset, cur - name_offset }; |
18 | cur++; // Consume the ':' |
19 | |
20 | // Now the cursor points to the first character of the header value |
21 | int value_offset = cur; |
22 | while (cur < src.size && src.data[cur] != '\r') |
23 | cur++; |
24 | if (cur == src.size) |
25 | return false; // Didn't find a '\r' |
26 | string header_value = { src.data + value_offset, cur - value_offset }; |
27 | |
28 | // Now we expect \r\n to terminate the header |
29 | if (1 >= src.size - cur |
30 | || src.data[cur+0] != '\r' |
31 | || src.data[cur+1] != '\n') |
32 | return false; // Didn't find \r\n |
33 | cur += 2; |
34 | |
35 | if (dst->num_headers == MAX_HEADERS) |
36 | return false; // We reached the end of the static array |
37 | dst->headers[dst->num_headers++] = (HTTPHeader) { header_name, header_value }; |
38 | } |
39 | |
40 | // We exited the loop, so we know there is a final \r\n we must skip |
41 | cur += 2; |
42 | |
43 | // Finished |
44 | return true; |
45 |
And there it is! The request parser! It may seem quite daunting, but it's the same trick repeated over and over. Here is the server with the added parser:
1 | #include <stdio.h> // printf |
2 | #include <stdbool.h> // bool, true, false |
3 | |
4 | #ifdef _WIN32 |
5 | #include <winsock2.h> |
6 | #define SOCKET_TYPE SOCKET |
7 | #define INVALID_SOCKET_VALUE INVALID_SOCKET |
8 | #define CLOSE_SOCKET closesocket |
9 | #else |
10 | #include <unistd.h> // close |
11 | #include <arpa/inet.h> // socket, htons, inet_addr, sockaddr_in, bind, listen, accept, recv, send |
12 | #define SOCKET_TYPE int |
13 | #define INVALID_SOCKET_VALUE -1 |
14 | #define CLOSE_SOCKET close |
15 | #endif |
16 | |
17 | typedef struct { char *data; int size; } string; |
18 | |
19 | #define MAX_HEADERS 32 |
20 | |
21 | typedef enum { |
22 | HTTP_METHOD_GET, |
23 | HTTP_METHOD_POST, |
24 | } HTTPMethod; |
25 | |
26 | typedef struct { |
27 | string name; |
28 | string value; |
29 | } HTTPHeader; |
30 | |
31 | typedef struct { |
32 | int major; |
33 | int minor; |
34 | } HTTPVersion; |
35 | |
36 | typedef struct { |
37 | HTTPMethod method; |
38 | string resource_path; |
39 | HTTPVersion version; |
40 | HTTPHeader headers[MAX_HEADERS]; |
41 | int num_headers; |
42 | } HTTPRequest; |
43 | |
44 | bool parse_request(string src, HTTPRequest *dst) |
45 | { |
46 | int cur = 0; |
47 | |
48 | if (3 < src.size |
49 | && src.data[0] == 'G' |
50 | && src.data[1] == 'E' |
51 | && src.data[2] == 'T' |
52 | && src.data[3] == ' ') { |
53 | dst->method = HTTP_METHOD_GET; |
54 | cur = 4; |
55 | } else if (4 < src.size |
56 | && src.data[0] == 'P' |
57 | && src.data[1] == 'O' |
58 | && src.data[2] == 'S' |
59 | && src.data[3] == 'T' |
60 | && src.data[4] == ' ') { |
61 | dst->method = HTTP_METHOD_POST; |
62 | cur = 5; |
63 | } else { |
64 | // Invalid method |
65 | return false; |
66 | } |
67 | |
68 | // Check that there is at least one non-space character where the cursor points |
69 | if (cur == src.size || src.data[cur] == ' ') |
70 | return false; // No path |
71 | |
72 | // Save the offset of the path in the string |
73 | int path_offset = cur; |
74 | |
75 | // The first character is not a space. Now loop until we find one |
76 | do |
77 | cur++; |
78 | while (cur < src.size && src.data[cur] != ' '); |
79 | |
80 | // There are two ways we exit the loop: |
81 | // 1) The cursor reached the end of the string because no space |
82 | // was found (cur == src.size) |
83 | // 2) We found a space (src.data[cur] == ' ') |
84 | // Of course (1) is an error |
85 | |
86 | if (cur == src.size) |
87 | return false; |
88 | |
89 | int path_length = cur - path_offset; |
90 | |
91 | // Consume the space that comes after the path |
92 | cur++; |
93 | |
94 | dst->resource_path = (string) { .data = src.data + path_offset, path_length }; |
95 | |
96 | // If we don't find the string "HTTP/", that's an error |
97 | if (4 >= src.size - cur |
98 | || src.data[cur+0] != 'H' |
99 | || src.data[cur+1] != 'T' |
100 | || src.data[cur+2] != 'T' |
101 | || src.data[cur+3] != 'P' |
102 | || src.data[cur+4] != '/') |
103 | return false; |
104 | cur += 5; |
105 | |
106 | // Now we expect either "1\r\n", "1.0\r\n", or "1.1\r\n" |
107 | if (4 < src.size - cur |
108 | && src.data[cur+0] == '1' |
109 | && src.data[cur+1] == '.' |
110 | && src.data[cur+2] == '1' |
111 | && src.data[cur+3] == '\r' |
112 | && src.data[cur+4] == '\n') { |
113 | cur += 5; |
114 | dst->version = (HTTPVersion) {1, 1}; |
115 | } else if (4 < src.size - cur |
116 | && src.data[cur+0] == '1' |
117 | && src.data[cur+1] == '.' |
118 | && src.data[cur+2] == '0' |
119 | && src.data[cur+3] == '\r' |
120 | && src.data[cur+4] == '\n') { |
121 | cur += 5; |
122 | dst->version = (HTTPVersion) {1, 0}; |
123 | } else if (2 < src.size - cur |
124 | && src.data[cur+0] == '1' |
125 | && src.data[cur+1] == '\r' |
126 | && src.data[cur+2] == '\n') { |
127 | cur += 3; |
128 | dst->version = (HTTPVersion) {1, 0}; |
129 | } else { |
130 | // Invalid version |
131 | return false; |
132 | } |
133 | |
134 | // Initialize the header array |
135 | dst->num_headers = 0; |
136 | |
137 | // Loop until we find the final \r\n |
138 | while (1 >= src.size - cur |
139 | || src.data[cur+0] != '\r' |
140 | || src.data[cur+1] != '\n') { |
141 | |
142 | // The cursor now points to the first character of the header's name |
143 | int name_offset = cur; |
144 | |
145 | // Consume characters until we get to the separator |
146 | while (cur < src.size && src.data[cur] != ':') |
147 | cur++; |
148 | if (cur == src.size) |
149 | return false; // Cursor reached the end of the string |
150 | string header_name = { src.data + name_offset, cur - name_offset }; |
151 | cur++; // Consume the ':' |
152 | |
153 | // Now the cursor points to the first character of the header value |
154 | int value_offset = cur; |
155 | while (cur < src.size && src.data[cur] != '\r') |
156 | cur++; |
157 | if (cur == src.size) |
158 | return false; // Didn't find a '\r' |
159 | string header_value = { src.data + value_offset, cur - value_offset }; |
160 | |
161 | // Now we expect \r\n to terminate the header |
162 | if (1 >= src.size - cur |
163 | || src.data[cur+0] != '\r' |
164 | || src.data[cur+1] != '\n') |
165 | return false; // Didn't find \r\n |
166 | cur += 2; |
167 | |
168 | if (dst->num_headers == MAX_HEADERS) |
169 | return false; // We reached the end of the static array |
170 | dst->headers[dst->num_headers++] = (HTTPHeader) { header_name, header_value }; |
171 | } |
172 | |
173 | // We exited the loop, so we know there is a final \r\n we must skip |
174 | cur += 2; |
175 | |
176 | return true; |
177 | } |
178 | |
179 | int send_all(SOCKET_TYPE sock, void *src, size_t num) |
180 | { |
181 | size_t sent = 0; |
182 | while (sent < num) { |
183 | int just_sent = send(sock, (char*) src + sent, num - sent, 0); |
184 | if (just_sent < 0) return -1; |
185 | sent += (size_t) just_sent; |
186 | } |
187 | return sent; |
188 | } |
189 | |
190 | int main() |
191 | { |
192 | #ifdef _WIN32 |
193 | WSADATA wsaData; |
194 | int err = WSAStartup(MAKEWORD(2, 2), &wsaData); |
195 | if (err != 0) { |
196 | printf("WSAStartup failed\n"); |
197 | return -1; |
198 | } |
199 | #endif |
200 | |
201 | // Create the listening socket |
202 | SOCKET_TYPE listen_socket = socket(AF_INET, SOCK_STREAM, 0); |
203 | if (listen_socket == INVALID_SOCKET_VALUE) { |
204 | printf("socket failed\n"); |
205 | return -1; |
206 | } |
207 | |
208 | struct sockaddr_in bind_buffer; |
209 | bind_buffer.sin_family = AF_INET; |
210 | bind_buffer.sin_port = htons(8080); |
211 | bind_buffer.sin_addr.s_addr = inet_addr("127.0.0.1"); |
212 | |
213 | if (bind(listen_socket, (struct sockaddr*) &bind_buffer, sizeof(bind_buffer))) { |
214 | printf("bind failed\n"); |
215 | return -1; |
216 | } |
217 | |
218 | if (listen(listen_socket, 32)) { |
219 | printf("listen failed\n"); |
220 | return -1; |
221 | } |
222 | |
223 | while (1) { |
224 | SOCKET_TYPE client_socket = accept(listen_socket, NULL, NULL); |
225 | if (client_socket == INVALID_SOCKET_VALUE) { |
226 | printf("accept failed\n"); |
227 | continue; |
228 | } |
229 | |
230 | char request_buffer[4096]; |
231 | int len = recv(client_socket, request_buffer, sizeof(request_buffer), 0); |
232 | if (len < 0) { |
233 | printf("recv failed\n"); |
234 | CLOSE_SOCKET(client_socket); |
235 | continue; |
236 | } |
237 | |
238 | HTTPRequest parsed_request; |
239 | if (!parse_request((string) {request_buffer, len}, &parsed_request)) { |
240 | // Parsing failed |
241 | char response_buffer[] = |
242 | "HTTP/1.0 400 Bad Request\r\n" |
243 | "Content-Length: 0\r\n" |
244 | "\r\n"; |
245 | send_all(client_socket, response_buffer, sizeof(response_buffer)); |
246 | |
247 | } else { |
248 | // Parsing succeded |
249 | char response_buffer[] = |
250 | "HTTP/1.0 200 OK\r\n" |
251 | "Content-Length: 13\r\n" |
252 | "Content-Type: text/plain\r\n" |
253 | "\r\n" |
254 | "Hello, world!"; |
255 | send_all(client_socket, response_buffer, sizeof(response_buffer)); |
256 | } |
257 | CLOSE_SOCKET(client_socket); |
258 | } |
259 | // This point will never be reached |
260 | } |
261 |
Handle partial reads
As we did with send, we need to make sure our call to recv read the entire request head. It is possible we read only part of the request head, in which case we need to call the recv function again. We can stop when we find the \r\n\r\n
, which signifies the end of the head and start of the body. If we fill up the buffer before we find such token, we consider that an error.
1 | int recv_request_head(SOCKET_TYPE sock, char *dst, int max, int *head_len) |
2 | { |
3 | int received = 0; |
4 | while (1) { |
5 | int just_received = recv(sock, dst + received, max - received, 0); |
6 | if (just_received < 0) return -1; |
7 | received += just_received; |
8 | |
9 | // Look for \r\n\r\n |
10 | int i = 0; |
11 | while (3 < received - i |
12 | && (dst[i+0] != '\r' |
13 | || dst[i+1] != '\n' |
14 | || dst[i+2] != '\r' |
15 | || dst[i+3] != '\n')) |
16 | i++; |
17 | if (3 < received - i) { |
18 | // We found the \r\n\r\n and it is at position "i" |
19 | // We consider the head to go from the first byte to the last \n |
20 | *head_len = i + 4; |
21 | break; |
22 | } |
23 | // We did not find the end of the head. If the buffer is now full that's an error |
24 | if (received == max) |
25 | return -1; |
26 | } |
27 | // If we are here we received the head. Note that we may have also read some bytes after the \r\n\r\n, which are part of the request body. |
28 | return received; |
29 | } |
30 |
With this, our main loop changes:
1 | // ... parsing and everything else stays the same ... |
2 | |
3 | int main() |
4 | { |
5 | // ... this code stays the same too ... |
6 | |
7 | while (1) { |
8 | SOCKET_TYPE client_socket = accept(listen_socket, NULL, NULL); |
9 | if (client_socket == INVALID_SOCKET_VALUE) { |
10 | printf("accept failed\n"); |
11 | continue; |
12 | } |
13 | |
14 | char request_buffer[4096]; |
15 | int received_total, head_len; |
16 | received_total = recv_request_head(client_socket, request_buffer, sizeof(request_buffer), &head_len); |
17 | if (received_total < 0) { |
18 | printf("recv_request_head failed\n"); |
19 | CLOSE_SOCKET(client_socket); |
20 | continue; |
21 | } |
22 | string request_head = {request_buffer, head_len}; |
23 | |
24 | HTTPRequest parsed_request; |
25 | if (!parse_request(request_head, &parsed_request)) { |
26 | // ... unchanged ... |
27 | } else { |
28 | // ... unchanged ... |
29 | } |
30 | CLOSE_SOCKET(client_socket); |
31 | } |
32 | // This point will never be reached |
33 | } |
34 |
Further improvements
This parser will work well most of the time, but there are a couple corner cases we haven't handled:
- We are blindly accepting characters while parsing the headers and path. These strings have specific syntaxes and allowed characters
- There is something called a "folded header", which is a header spanning over multiple lines. We don't need to handle them, but recognize and reject them at least.