Cozis

Building web apps from scratch - Request Parsing - Part 2

NOTE: This post is still a work in progress!! Our little web server is very cool, isn't it? The missing thing now, from a feature stand point, is sending a response based on what the client requested. But first, we need to improve its robustness a bit. There are some error corner cases that we haven't handled yet.

Handling partial sends

The first issue is relative to recv and send. We assumed that when we pass a buffer to send, the system will send the entire message or fail. This isn't correct. This function can also do partial sends. Lets say we want to send 100 bytes. The first time we send only 10 bytes may be sent, so we need to call send again with the remaining bytes

1int send_all(SOCKET_TYPE sock, void *src, size_t num)
2{
3 size_t sent = 0;
4 while (sent < num) {
5 int just_sent = send(sock, (char*) src + sent, num - sent, 0);
6 if (just_sent < 0) return -1;
7 sent += (size_t) just_sent;
8 }
9 return sent;
10}
11

This function calls send multiple times until every byte has been sent. If an error occurs before sending all bytes, it returns early with -1. By using this function in place of send we made our server much more reliable! The recv function behaves in a similar way. Instead of worrying about sending only part of the bytes we need to worry about not receiving exactly how many bytes we had in our buffer. But since we are ignoring the incoming message for now, it does not make a difference.

The syntax of an HTTP request

Now we are ready to parse the client's request. This will allow us to return different responses based on what the client sent us. The proper way to go about this would be reading the HTTP specification and implement the parser accordingly, but starting with an approximation is good enough for now. Here's an example of HTTP request:

GET / HTTP/1.1
Host: 127.0.0.1:8080
Connection: keep-alive
sec-ch-ua: "Not(A:Brand";v="99", "Brave";v="133", "Chromium";v="133"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8
Sec-GPC: 1
Accept-Language: en-US,en;q=0.6
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br, zstd

I obtained this by simply adding a print statement in our server after recv (don't forget the null terminator!). The first line contains three things:

  1. method
  2. resource path
  3. HTTP version

The method changes how the request is handled. The most common methods are

  1. GET: The client is asking us for a resource
  2. POST: The client is sending us some data

this is a bit simplistic, but we can focus on this later. You can think of a resource path as the file name the client is requesting. Though the resource may be something other than a file, for instance we could have a resource called /users which returns a list of users returned by the database. The version we'll support for now is HTTP/1.0, but this does not matter as version 1 and 1.1 use the same syntax. The first line terminates with a new line, which is made with a carriage return and newline character. In C, these characters are written as \r\n. After that, we have a list of headers, strings in the form:

name: value\r\n

we can mostly ignore these and only add support for specific ones as we need them. The one we are interested in is the Content-Length header, which tells us the length in bytes of the request payload. GET requests don't have a payload, so this header is not used. We will need it for POST requests. After the entire header list, we find a \r\n character, not considering the \r\n used to close the last header. If the request had a payload, this is where it would start. Just, to give you an example, this is how a POST request would look like:

POST /some-resource HTTP/1.0\r\n
host: 127.0.0.1\r\n
Content-Length: 15\r\n
\r\n
I'm the payload

Parsing a request

Now we start writing the parser. There are many ways one can go about parsing, This is the way I found is easier for HTTP. The basic idea is we have a pointer to the request bytes, a length, and a cursor. We read the bytes by advancing the cursor and extract information from it by adding it to a structure. When the cursor reaches the end, we have a structure with all the information from the request bytes, easily accessible for our server. The request structure looks like this:

1typedef struct { char *data; int size; } string;
2
3#define MAX_HEADERS 32
4
5typedef enum {
6 HTTP_METHOD_GET,
7 HTTP_METHOD_POST,
8} HTTPMethod;
9
10typedef struct {
11 string name;
12 string value;
13} HTTPHeader;
14
15typedef struct {
16 int major;
17 int minor;
18} HTTPVersion;
19
20typedef struct {
21 HTTPMethod method;
22 string resource_path;
23 HTTPVersion version;
24 HTTPHeader headers[MAX_HEADERS];
25 int num_headers;
26} HTTPRequest;
27

Notice how we created a helper structure for strings. This will come in handy for the entire project! The parsing function will use this interface:

1bool parse_request(string src, HTTPRequest *dst)
2{
3 int cur = 0;
4
5 // .. parsing code goes here ..
6}
7

if we parsed the request successfully, true is returned. If some error occurred, then false is returned. When the we succede, the dst argument will be initialized. First, we parse the method. If the request doesn't start with GET or POST, we consider it an error:

1if (3 < src.size
2 && src.data[0] == 'G'
3 && src.data[1] == 'E'
4 && src.data[2] == 'T'
5 && src.data[3] == ' ') {
6 dst->method = HTTP_METHOD_GET;
7 cur = 4;
8} else if (4 < src.size
9 && src.data[0] == 'P'
10 && src.data[1] == 'O'
11 && src.data[2] == 'S'
12 && src.data[3] == 'T'
13 && src.data[4] == ' ') {
14 dst->method = HTTP_METHOD_POST;
15 cur = 5;
16} else {
17 // Invalid method
18 return false;
19}
20

After this block, the cursor will point to the character that comes after the the first space, so the first character of the request path. The path goes from the first space to the second space.

1// Check that there is at least one non-space character where the cursor points
2if (cur == src.size || src.data[cur] == ' ')
3 return false; // No path
4
5// Save the offset of the path in the string
6int path_offset = cur;
7
8// The first character is not a space. Now loop until we find one
9do
10 cur++;
11while (cur < src.size && src.data[cur] != ' ');
12
13// There are two ways we exit the loop:
14// 1) The cursor reached the end of the string because no space
15// was found (cur == src.size)
16// 2) We found a space (src.data[cur] == ' ')
17// Of course (1) is an error
18
19if (cur == src.size)
20 return false;
21
22int path_length = cur - path_offset;
23
24// Consume the space that comes after the path
25cur++;
26
27dst->resource_path = (string) { .data = src.data + path_offset, path_length };
28

Instead creating a copy of the resource path to store it in the HTTPRequest structure, we created a string that pointed inside the input buffer. With this trick we avoided a dynamic allocation. The downside of this is that the contents of HTTPRequest now depend on the input buffer staying around. Now we parse the version. We expect one of the following strings: HTTP/1, HTTP/1.0, HTTP/1.1, followed by \r\n

1// If we don't find the string "HTTP/", that's an error
2 if (4 >= src.size - cur
3 || src.data[cur+0] != 'H'
4 || src.data[cur+1] != 'T'
5 || src.data[cur+2] != 'T'
6 || src.data[cur+3] != 'P'
7 || src.data[cur+4] != '/')
8 return false;
9cur += 5;
10
11// Now we expect either "1\r\n", "1.0\r\n", or "1.1\r\n"
12if (4 < src.size - cur
13 && src.data[cur+0] == '1'
14 && src.data[cur+1] == '.'
15 && src.data[cur+2] == '1'
16 && src.data[cur+3] == '\r'
17 && src.data[cur+4] == '\n') {
18 cur += 5;
19 dst->version = (HTTPVersion) {1, 1};
20} else if (4 < src.size - cur
21 && src.data[cur+0] == '1'
22 && src.data[cur+1] == '.'
23 && src.data[cur+2] == '0'
24 && src.data[cur+3] == '\r'
25 && src.data[cur+4] == '\n') {
26 cur += 5;
27 dst->version = (HTTPVersion) {1, 0};
28} else if (2 < src.size - cur
29 && src.data[cur+0] == '1'
30 && src.data[cur+1] == '\r'
31 && src.data[cur+2] == '\n') {
32 cur += 3;
33 dst->version = (HTTPVersion) {1, 0};
34} else {
35 // Invalid version
36 return false;
37}
38

And that was the first line. Now comes the easy part! Now the cursor points to the first character of the list of headers. We must consume headers until we find the final \r\n which denotes the end of the request head.

1// Initialize the header array
2dst->num_headers = 0;
3
4// Loop until we find the final \r\n
5while (1 >= src.size - cur
6 || src.data[cur+0] != '\r'
7 || src.data[cur+1] != '\n') {
8
9 // The cursor now points to the first character of the header's name
10 int name_offset = cur;
11
12 // Consume characters until we get to the separator
13 while (cur < src.size && src.data[cur] != ':')
14 cur++;
15 if (cur == src.size)
16 return false; // Cursor reached the end of the string
17 string header_name = { src.data + name_offset, cur - name_offset };
18 cur++; // Consume the ':'
19
20 // Now the cursor points to the first character of the header value
21 int value_offset = cur;
22 while (cur < src.size && src.data[cur] != '\r')
23 cur++;
24 if (cur == src.size)
25 return false; // Didn't find a '\r'
26 string header_value = { src.data + value_offset, cur - value_offset };
27
28 // Now we expect \r\n to terminate the header
29 if (1 >= src.size - cur
30 || src.data[cur+0] != '\r'
31 || src.data[cur+1] != '\n')
32 return false; // Didn't find \r\n
33 cur += 2;
34
35 if (dst->num_headers == MAX_HEADERS)
36 return false; // We reached the end of the static array
37 dst->headers[dst->num_headers++] = (HTTPHeader) { header_name, header_value };
38}
39
40// We exited the loop, so we know there is a final \r\n we must skip
41cur += 2;
42
43// Finished
44return true;
45

And there it is! The request parser! It may seem quite daunting, but it's the same trick repeated over and over. Here is the server with the added parser:

1#include <stdio.h> // printf
2#include <stdbool.h> // bool, true, false
3
4#ifdef _WIN32
5#include <winsock2.h>
6#define SOCKET_TYPE SOCKET
7#define INVALID_SOCKET_VALUE INVALID_SOCKET
8#define CLOSE_SOCKET closesocket
9#else
10#include <unistd.h> // close
11#include <arpa/inet.h> // socket, htons, inet_addr, sockaddr_in, bind, listen, accept, recv, send
12#define SOCKET_TYPE int
13#define INVALID_SOCKET_VALUE -1
14#define CLOSE_SOCKET close
15#endif
16
17typedef struct { char *data; int size; } string;
18
19#define MAX_HEADERS 32
20
21typedef enum {
22 HTTP_METHOD_GET,
23 HTTP_METHOD_POST,
24} HTTPMethod;
25
26typedef struct {
27 string name;
28 string value;
29} HTTPHeader;
30
31typedef struct {
32 int major;
33 int minor;
34} HTTPVersion;
35
36typedef struct {
37 HTTPMethod method;
38 string resource_path;
39 HTTPVersion version;
40 HTTPHeader headers[MAX_HEADERS];
41 int num_headers;
42} HTTPRequest;
43
44bool parse_request(string src, HTTPRequest *dst)
45{
46 int cur = 0;
47
48 if (3 < src.size
49 && src.data[0] == 'G'
50 && src.data[1] == 'E'
51 && src.data[2] == 'T'
52 && src.data[3] == ' ') {
53 dst->method = HTTP_METHOD_GET;
54 cur = 4;
55 } else if (4 < src.size
56 && src.data[0] == 'P'
57 && src.data[1] == 'O'
58 && src.data[2] == 'S'
59 && src.data[3] == 'T'
60 && src.data[4] == ' ') {
61 dst->method = HTTP_METHOD_POST;
62 cur = 5;
63 } else {
64 // Invalid method
65 return false;
66 }
67
68 // Check that there is at least one non-space character where the cursor points
69 if (cur == src.size || src.data[cur] == ' ')
70 return false; // No path
71
72 // Save the offset of the path in the string
73 int path_offset = cur;
74
75 // The first character is not a space. Now loop until we find one
76 do
77 cur++;
78 while (cur < src.size && src.data[cur] != ' ');
79
80 // There are two ways we exit the loop:
81 // 1) The cursor reached the end of the string because no space
82 // was found (cur == src.size)
83 // 2) We found a space (src.data[cur] == ' ')
84 // Of course (1) is an error
85
86 if (cur == src.size)
87 return false;
88
89 int path_length = cur - path_offset;
90
91 // Consume the space that comes after the path
92 cur++;
93
94 dst->resource_path = (string) { .data = src.data + path_offset, path_length };
95
96 // If we don't find the string "HTTP/", that's an error
97 if (4 >= src.size - cur
98 || src.data[cur+0] != 'H'
99 || src.data[cur+1] != 'T'
100 || src.data[cur+2] != 'T'
101 || src.data[cur+3] != 'P'
102 || src.data[cur+4] != '/')
103 return false;
104 cur += 5;
105
106 // Now we expect either "1\r\n", "1.0\r\n", or "1.1\r\n"
107 if (4 < src.size - cur
108 && src.data[cur+0] == '1'
109 && src.data[cur+1] == '.'
110 && src.data[cur+2] == '1'
111 && src.data[cur+3] == '\r'
112 && src.data[cur+4] == '\n') {
113 cur += 5;
114 dst->version = (HTTPVersion) {1, 1};
115 } else if (4 < src.size - cur
116 && src.data[cur+0] == '1'
117 && src.data[cur+1] == '.'
118 && src.data[cur+2] == '0'
119 && src.data[cur+3] == '\r'
120 && src.data[cur+4] == '\n') {
121 cur += 5;
122 dst->version = (HTTPVersion) {1, 0};
123 } else if (2 < src.size - cur
124 && src.data[cur+0] == '1'
125 && src.data[cur+1] == '\r'
126 && src.data[cur+2] == '\n') {
127 cur += 3;
128 dst->version = (HTTPVersion) {1, 0};
129 } else {
130 // Invalid version
131 return false;
132 }
133
134 // Initialize the header array
135 dst->num_headers = 0;
136
137 // Loop until we find the final \r\n
138 while (1 >= src.size - cur
139 || src.data[cur+0] != '\r'
140 || src.data[cur+1] != '\n') {
141
142 // The cursor now points to the first character of the header's name
143 int name_offset = cur;
144
145 // Consume characters until we get to the separator
146 while (cur < src.size && src.data[cur] != ':')
147 cur++;
148 if (cur == src.size)
149 return false; // Cursor reached the end of the string
150 string header_name = { src.data + name_offset, cur - name_offset };
151 cur++; // Consume the ':'
152
153 // Now the cursor points to the first character of the header value
154 int value_offset = cur;
155 while (cur < src.size && src.data[cur] != '\r')
156 cur++;
157 if (cur == src.size)
158 return false; // Didn't find a '\r'
159 string header_value = { src.data + value_offset, cur - value_offset };
160
161 // Now we expect \r\n to terminate the header
162 if (1 >= src.size - cur
163 || src.data[cur+0] != '\r'
164 || src.data[cur+1] != '\n')
165 return false; // Didn't find \r\n
166 cur += 2;
167
168 if (dst->num_headers == MAX_HEADERS)
169 return false; // We reached the end of the static array
170 dst->headers[dst->num_headers++] = (HTTPHeader) { header_name, header_value };
171 }
172
173 // We exited the loop, so we know there is a final \r\n we must skip
174 cur += 2;
175
176 return true;
177}
178
179int send_all(SOCKET_TYPE sock, void *src, size_t num)
180{
181 size_t sent = 0;
182 while (sent < num) {
183 int just_sent = send(sock, (char*) src + sent, num - sent, 0);
184 if (just_sent < 0) return -1;
185 sent += (size_t) just_sent;
186 }
187 return sent;
188}
189
190int main()
191{
192#ifdef _WIN32
193 WSADATA wsaData;
194 int err = WSAStartup(MAKEWORD(2, 2), &wsaData);
195 if (err != 0) {
196 printf("WSAStartup failed\n");
197 return -1;
198 }
199#endif
200
201 // Create the listening socket
202 SOCKET_TYPE listen_socket = socket(AF_INET, SOCK_STREAM, 0);
203 if (listen_socket == INVALID_SOCKET_VALUE) {
204 printf("socket failed\n");
205 return -1;
206 }
207
208 struct sockaddr_in bind_buffer;
209 bind_buffer.sin_family = AF_INET;
210 bind_buffer.sin_port = htons(8080);
211 bind_buffer.sin_addr.s_addr = inet_addr("127.0.0.1");
212
213 if (bind(listen_socket, (struct sockaddr*) &bind_buffer, sizeof(bind_buffer))) {
214 printf("bind failed\n");
215 return -1;
216 }
217
218 if (listen(listen_socket, 32)) {
219 printf("listen failed\n");
220 return -1;
221 }
222
223 while (1) {
224 SOCKET_TYPE client_socket = accept(listen_socket, NULL, NULL);
225 if (client_socket == INVALID_SOCKET_VALUE) {
226 printf("accept failed\n");
227 continue;
228 }
229
230 char request_buffer[4096];
231 int len = recv(client_socket, request_buffer, sizeof(request_buffer), 0);
232 if (len < 0) {
233 printf("recv failed\n");
234 CLOSE_SOCKET(client_socket);
235 continue;
236 }
237
238 HTTPRequest parsed_request;
239 if (!parse_request((string) {request_buffer, len}, &parsed_request)) {
240 // Parsing failed
241 char response_buffer[] =
242 "HTTP/1.0 400 Bad Request\r\n"
243 "Content-Length: 0\r\n"
244 "\r\n";
245 send_all(client_socket, response_buffer, sizeof(response_buffer));
246
247 } else {
248 // Parsing succeded
249 char response_buffer[] =
250 "HTTP/1.0 200 OK\r\n"
251 "Content-Length: 13\r\n"
252 "Content-Type: text/plain\r\n"
253 "\r\n"
254 "Hello, world!";
255 send_all(client_socket, response_buffer, sizeof(response_buffer));
256 }
257 CLOSE_SOCKET(client_socket);
258 }
259 // This point will never be reached
260}
261

Handle partial reads

As we did with send, we need to make sure our call to recv read the entire request head. It is possible we read only part of the request head, in which case we need to call the recv function again. We can stop when we find the \r\n\r\n, which signifies the end of the head and start of the body. If we fill up the buffer before we find such token, we consider that an error.

1int recv_request_head(SOCKET_TYPE sock, char *dst, int max, int *head_len)
2{
3 int received = 0;
4 while (1) {
5 int just_received = recv(sock, dst + received, max - received, 0);
6 if (just_received < 0) return -1;
7 received += just_received;
8
9 // Look for \r\n\r\n
10 int i = 0;
11 while (3 < received - i
12 && (dst[i+0] != '\r'
13 || dst[i+1] != '\n'
14 || dst[i+2] != '\r'
15 || dst[i+3] != '\n'))
16 i++;
17 if (3 < received - i) {
18 // We found the \r\n\r\n and it is at position "i"
19 // We consider the head to go from the first byte to the last \n
20 *head_len = i + 4;
21 break;
22 }
23 // We did not find the end of the head. If the buffer is now full that's an error
24 if (received == max)
25 return -1;
26 }
27 // If we are here we received the head. Note that we may have also read some bytes after the \r\n\r\n, which are part of the request body.
28 return received;
29}
30

With this, our main loop changes:

1// ... parsing and everything else stays the same ...
2
3int main()
4{
5 // ... this code stays the same too ...
6
7 while (1) {
8 SOCKET_TYPE client_socket = accept(listen_socket, NULL, NULL);
9 if (client_socket == INVALID_SOCKET_VALUE) {
10 printf("accept failed\n");
11 continue;
12 }
13
14 char request_buffer[4096];
15 int received_total, head_len;
16 received_total = recv_request_head(client_socket, request_buffer, sizeof(request_buffer), &head_len);
17 if (received_total < 0) {
18 printf("recv_request_head failed\n");
19 CLOSE_SOCKET(client_socket);
20 continue;
21 }
22 string request_head = {request_buffer, head_len};
23
24 HTTPRequest parsed_request;
25 if (!parse_request(request_head, &parsed_request)) {
26 // ... unchanged ...
27 } else {
28 // ... unchanged ...
29 }
30 CLOSE_SOCKET(client_socket);
31 }
32 // This point will never be reached
33}
34

Further improvements

This parser will work well most of the time, but there are a couple corner cases we haven't handled: