Building web apps from scratch - Request Parsing - Part 2

NOTE: This post is still a work in progress!! Our little web server is very cool, isn't it? The missing thing now, from a feature stand point, is sending a response based on what the client requested. But first, we need to improve its robustness a bit. There are some error corner cases that we haven't handled yet.

Handling partial sends

The first issue is relative to recv and send. We assumed that when we pass a buffer to send, the system will send the entire message or fail. This isn't correct. This function can also do partial sends. Lets say we want to send 100 bytes. The first time we send only 10 bytes may be sent, so we need to call send again with the remaining bytes

1	int send_all(SOCKET_TYPE sock, void *src, size_t num)
2	{
3	size_t sent = 0;
4	while (sent < num) {
5	int just_sent = send(sock, (char*) src + sent, num - sent, 0);
6	if (just_sent < 0) return -1;
7	sent += (size_t) just_sent;
8	}
9	return sent;
10	}
11

This function calls send multiple times until every byte has been sent. If an error occurs before sending all bytes, it returns early with -1. By using this function in place of send we made our server much more reliable! The recv function behaves in a similar way. Instead of worrying about sending only part of the bytes we need to worry about not receiving exactly how many bytes we had in our buffer. But since we are ignoring the incoming message for now, it does not make a difference.

The syntax of an HTTP request

Now we are ready to parse the client's request. This will allow us to return different responses based on what the client sent us. The proper way to go about this would be reading the HTTP specification and implement the parser accordingly, but starting with an approximation is good enough for now. Here's an example of HTTP request:

GET / HTTP/1.1
Host: 127.0.0.1:8080
Connection: keep-alive
sec-ch-ua: "Not(A:Brand";v="99", "Brave";v="133", "Chromium";v="133"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8
Sec-GPC: 1
Accept-Language: en-US,en;q=0.6
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br, zstd

I obtained this by simply adding a print statement in our server after recv (don't forget the null terminator!). The first line contains three things:

method
resource path
HTTP version

The method changes how the request is handled. The most common methods are

GET: The client is asking us for a resource
POST: The client is sending us some data

this is a bit simplistic, but we can focus on this later. You can think of a resource path as the file name the client is requesting. Though the resource may be something other than a file, for instance we could have a resource called /users which returns a list of users returned by the database. The version we'll support for now is HTTP/1.0, but this does not matter as version 1 and 1.1 use the same syntax. The first line terminates with a new line, which is made with a carriage return and newline character. In C, these characters are written as \r\n. After that, we have a list of headers, strings in the form:

name: value\r\n

we can mostly ignore these and only add support for specific ones as we need them. The one we are interested in is the Content-Length header, which tells us the length in bytes of the request payload. GET requests don't have a payload, so this header is not used. We will need it for POST requests. After the entire header list, we find a \r\n character, not considering the \r\n used to close the last header. If the request had a payload, this is where it would start. Just, to give you an example, this is how a POST request would look like:

POST /some-resource HTTP/1.0\r\n
host: 127.0.0.1\r\n
Content-Length: 15\r\n
\r\n
I'm the payload

Parsing a request

Now we start writing the parser. There are many ways one can go about parsing, This is the way I found is easier for HTTP. The basic idea is we have a pointer to the request bytes, a length, and a cursor. We read the bytes by advancing the cursor and extract information from it by adding it to a structure. When the cursor reaches the end, we have a structure with all the information from the request bytes, easily accessible for our server. The request structure looks like this:

1	typedef struct { char *data; int size; } string;
2
3	#define MAX_HEADERS 32
4
5	typedef enum {
6	HTTP_METHOD_GET,
7	HTTP_METHOD_POST,
8	} HTTPMethod;
9
10	typedef struct {
11	string name;
12	string value;
13	} HTTPHeader;
14
15	typedef struct {
16	int major;
17	int minor;
18	} HTTPVersion;
19
20	typedef struct {
21	HTTPMethod method;
22	string resource_path;
23	HTTPVersion version;
24	HTTPHeader headers[MAX_HEADERS];
25	int num_headers;
26	} HTTPRequest;
27

Notice how we created a helper structure for strings. This will come in handy for the entire project! The parsing function will use this interface:

1	bool parse_request(string src, HTTPRequest *dst)
2	{
3	int cur = 0;
4
5	// .. parsing code goes here ..
6	}
7

if we parsed the request successfully, true is returned. If some error occurred, then false is returned. When the we succede, the dst argument will be initialized. First, we parse the method. If the request doesn't start with GET or POST, we consider it an error:

1	if (3 < src.size
2	&& src.data[0] == 'G'
3	&& src.data[1] == 'E'
4	&& src.data[2] == 'T'
5	&& src.data[3] == ' ') {
6	dst->method = HTTP_METHOD_GET;
7	cur = 4;
8	} else if (4 < src.size
9	&& src.data[0] == 'P'
10	&& src.data[1] == 'O'
11	&& src.data[2] == 'S'
12	&& src.data[3] == 'T'
13	&& src.data[4] == ' ') {
14	dst->method = HTTP_METHOD_POST;
15	cur = 5;
16	} else {
17	// Invalid method
18	return false;
19	}
20

After this block, the cursor will point to the character that comes after the the first space, so the first character of the request path. The path goes from the first space to the second space.

1	// Check that there is at least one non-space character where the cursor points
2	if (cur == src.size \|\| src.data[cur] == ' ')
3	return false; // No path
4
5	// Save the offset of the path in the string
6	int path_offset = cur;
7
8	// The first character is not a space. Now loop until we find one
9	do
10	cur++;
11	while (cur < src.size && src.data[cur] != ' ');
12
13	// There are two ways we exit the loop:
14	// 1) The cursor reached the end of the string because no space
15	// was found (cur == src.size)
16	// 2) We found a space (src.data[cur] == ' ')
17	// Of course (1) is an error
18
19	if (cur == src.size)
20	return false;
21
22	int path_length = cur - path_offset;
23
24	// Consume the space that comes after the path
25	cur++;
26
27	dst->resource_path = (string) { .data = src.data + path_offset, path_length };
28

Instead creating a copy of the resource path to store it in the HTTPRequest structure, we created a string that pointed inside the input buffer. With this trick we avoided a dynamic allocation. The downside of this is that the contents of HTTPRequest now depend on the input buffer staying around. Now we parse the version. We expect one of the following strings: HTTP/1, HTTP/1.0, HTTP/1.1, followed by \r\n

1	// If we don't find the string "HTTP/", that's an error
2	if (4 >= src.size - cur
3	\|\| src.data[cur+0] != 'H'
4	\|\| src.data[cur+1] != 'T'
5	\|\| src.data[cur+2] != 'T'
6	\|\| src.data[cur+3] != 'P'
7	\|\| src.data[cur+4] != '/')
8	return false;
9	cur += 5;
10
11	// Now we expect either "1\r\n", "1.0\r\n", or "1.1\r\n"
12	if (4 < src.size - cur
13	&& src.data[cur+0] == '1'
14	&& src.data[cur+1] == '.'
15	&& src.data[cur+2] == '1'
16	&& src.data[cur+3] == '\r'
17	&& src.data[cur+4] == '\n') {
18	cur += 5;
19	dst->version = (HTTPVersion) {1, 1};
20	} else if (4 < src.size - cur
21	&& src.data[cur+0] == '1'
22	&& src.data[cur+1] == '.'
23	&& src.data[cur+2] == '0'
24	&& src.data[cur+3] == '\r'
25	&& src.data[cur+4] == '\n') {
26	cur += 5;
27	dst->version = (HTTPVersion) {1, 0};
28	} else if (2 < src.size - cur
29	&& src.data[cur+0] == '1'
30	&& src.data[cur+1] == '\r'
31	&& src.data[cur+2] == '\n') {
32	cur += 3;
33	dst->version = (HTTPVersion) {1, 0};
34	} else {
35	// Invalid version
36	return false;
37	}
38

And that was the first line. Now comes the easy part! Now the cursor points to the first character of the list of headers. We must consume headers until we find the final \r\n which denotes the end of the request head.

1	// Initialize the header array
2	dst->num_headers = 0;
3
4	// Loop until we find the final \r\n
5	while (1 >= src.size - cur
6	\|\| src.data[cur+0] != '\r'
7	\|\| src.data[cur+1] != '\n') {
8
9	// The cursor now points to the first character of the header's name
10	int name_offset = cur;
11
12	// Consume characters until we get to the separator
13	while (cur < src.size && src.data[cur] != ':')
14	cur++;
15	if (cur == src.size)
16	return false; // Cursor reached the end of the string
17	string header_name = { src.data + name_offset, cur - name_offset };
18	cur++; // Consume the ':'
19
20	// Now the cursor points to the first character of the header value
21	int value_offset = cur;
22	while (cur < src.size && src.data[cur] != '\r')
23	cur++;
24	if (cur == src.size)
25	return false; // Didn't find a '\r'
26	string header_value = { src.data + value_offset, cur - value_offset };
27
28	// Now we expect \r\n to terminate the header
29	if (1 >= src.size - cur
30	\|\| src.data[cur+0] != '\r'
31	\|\| src.data[cur+1] != '\n')
32	return false; // Didn't find \r\n
33	cur += 2;
34
35	if (dst->num_headers == MAX_HEADERS)
36	return false; // We reached the end of the static array
37	dst->headers[dst->num_headers++] = (HTTPHeader) { header_name, header_value };
38	}
39
40	// We exited the loop, so we know there is a final \r\n we must skip
41	cur += 2;
42
43	// Finished
44	return true;
45

And there it is! The request parser! It may seem quite daunting, but it's the same trick repeated over and over. Here is the server with the added parser:

1	#include <stdio.h> // printf
2	#include <stdbool.h> // bool, true, false
3
4	#ifdef _WIN32
5	#include <winsock2.h>
6	#define SOCKET_TYPE SOCKET
7	#define INVALID_SOCKET_VALUE INVALID_SOCKET
8	#define CLOSE_SOCKET closesocket
9	#else
10	#include <unistd.h> // close
11	#include <arpa/inet.h> // socket, htons, inet_addr, sockaddr_in, bind, listen, accept, recv, send
12	#define SOCKET_TYPE int
13	#define INVALID_SOCKET_VALUE -1
14	#define CLOSE_SOCKET close
15	#endif
16
17	typedef struct { char *data; int size; } string;
18
19	#define MAX_HEADERS 32
20
21	typedef enum {
22	HTTP_METHOD_GET,
23	HTTP_METHOD_POST,
24	} HTTPMethod;
25
26	typedef struct {
27	string name;
28	string value;
29	} HTTPHeader;
30
31	typedef struct {
32	int major;
33	int minor;
34	} HTTPVersion;
35
36	typedef struct {
37	HTTPMethod method;
38	string resource_path;
39	HTTPVersion version;
40	HTTPHeader headers[MAX_HEADERS];
41	int num_headers;
42	} HTTPRequest;
43
44	bool parse_request(string src, HTTPRequest *dst)
45	{
46	int cur = 0;
47
48	if (3 < src.size
49	&& src.data[0] == 'G'
50	&& src.data[1] == 'E'
51	&& src.data[2] == 'T'
52	&& src.data[3] == ' ') {
53	dst->method = HTTP_METHOD_GET;
54	cur = 4;
55	} else if (4 < src.size
56	&& src.data[0] == 'P'
57	&& src.data[1] == 'O'
58	&& src.data[2] == 'S'
59	&& src.data[3] == 'T'
60	&& src.data[4] == ' ') {
61	dst->method = HTTP_METHOD_POST;
62	cur = 5;
63	} else {
64	// Invalid method
65	return false;
66	}
67
68	// Check that there is at least one non-space character where the cursor points
69	if (cur == src.size \|\| src.data[cur] == ' ')
70	return false; // No path
71
72	// Save the offset of the path in the string
73	int path_offset = cur;
74
75	// The first character is not a space. Now loop until we find one
76	do
77	cur++;
78	while (cur < src.size && src.data[cur] != ' ');
79
80	// There are two ways we exit the loop:
81	// 1) The cursor reached the end of the string because no space
82	// was found (cur == src.size)
83	// 2) We found a space (src.data[cur] == ' ')
84	// Of course (1) is an error
85
86	if (cur == src.size)
87	return false;
88
89	int path_length = cur - path_offset;
90
91	// Consume the space that comes after the path
92	cur++;
93
94	dst->resource_path = (string) { .data = src.data + path_offset, path_length };
95
96	// If we don't find the string "HTTP/", that's an error
97	if (4 >= src.size - cur
98	\|\| src.data[cur+0] != 'H'
99	\|\| src.data[cur+1] != 'T'
100	\|\| src.data[cur+2] != 'T'
101	\|\| src.data[cur+3] != 'P'
102	\|\| src.data[cur+4] != '/')
103	return false;
104	cur += 5;
105
106	// Now we expect either "1\r\n", "1.0\r\n", or "1.1\r\n"
107	if (4 < src.size - cur
108	&& src.data[cur+0] == '1'
109	&& src.data[cur+1] == '.'
110	&& src.data[cur+2] == '1'
111	&& src.data[cur+3] == '\r'
112	&& src.data[cur+4] == '\n') {
113	cur += 5;
114	dst->version = (HTTPVersion) {1, 1};
115	} else if (4 < src.size - cur
116	&& src.data[cur+0] == '1'
117	&& src.data[cur+1] == '.'
118	&& src.data[cur+2] == '0'
119	&& src.data[cur+3] == '\r'
120	&& src.data[cur+4] == '\n') {
121	cur += 5;
122	dst->version = (HTTPVersion) {1, 0};
123	} else if (2 < src.size - cur
124	&& src.data[cur+0] == '1'
125	&& src.data[cur+1] == '\r'
126	&& src.data[cur+2] == '\n') {
127	cur += 3;
128	dst->version = (HTTPVersion) {1, 0};
129	} else {
130	// Invalid version
131	return false;
132	}
133
134	// Initialize the header array
135	dst->num_headers = 0;
136
137	// Loop until we find the final \r\n
138	while (1 >= src.size - cur
139	\|\| src.data[cur+0] != '\r'
140	\|\| src.data[cur+1] != '\n') {
141
142	// The cursor now points to the first character of the header's name
143	int name_offset = cur;
144
145	// Consume characters until we get to the separator
146	while (cur < src.size && src.data[cur] != ':')
147	cur++;
148	if (cur == src.size)
149	return false; // Cursor reached the end of the string
150	string header_name = { src.data + name_offset, cur - name_offset };
151	cur++; // Consume the ':'
152
153	// Now the cursor points to the first character of the header value
154	int value_offset = cur;
155	while (cur < src.size && src.data[cur] != '\r')
156	cur++;
157	if (cur == src.size)
158	return false; // Didn't find a '\r'
159	string header_value = { src.data + value_offset, cur - value_offset };
160
161	// Now we expect \r\n to terminate the header
162	if (1 >= src.size - cur
163	\|\| src.data[cur+0] != '\r'
164	\|\| src.data[cur+1] != '\n')
165	return false; // Didn't find \r\n
166	cur += 2;
167
168	if (dst->num_headers == MAX_HEADERS)
169	return false; // We reached the end of the static array
170	dst->headers[dst->num_headers++] = (HTTPHeader) { header_name, header_value };
171	}
172
173	// We exited the loop, so we know there is a final \r\n we must skip
174	cur += 2;
175
176	return true;
177	}
178
179	int send_all(SOCKET_TYPE sock, void *src, size_t num)
180	{
181	size_t sent = 0;
182	while (sent < num) {
183	int just_sent = send(sock, (char*) src + sent, num - sent, 0);
184	if (just_sent < 0) return -1;
185	sent += (size_t) just_sent;
186	}
187	return sent;
188	}
189
190	int main()
191	{
192	#ifdef _WIN32
193	WSADATA wsaData;
194	int err = WSAStartup(MAKEWORD(2, 2), &wsaData);
195	if (err != 0) {
196	printf("WSAStartup failed\n");
197	return -1;
198	}
199	#endif
200
201	// Create the listening socket
202	SOCKET_TYPE listen_socket = socket(AF_INET, SOCK_STREAM, 0);
203	if (listen_socket == INVALID_SOCKET_VALUE) {
204	printf("socket failed\n");
205	return -1;
206	}
207
208	struct sockaddr_in bind_buffer;
209	bind_buffer.sin_family = AF_INET;
210	bind_buffer.sin_port = htons(8080);
211	bind_buffer.sin_addr.s_addr = inet_addr("127.0.0.1");
212
213	if (bind(listen_socket, (struct sockaddr*) &bind_buffer, sizeof(bind_buffer))) {
214	printf("bind failed\n");
215	return -1;
216	}
217
218	if (listen(listen_socket, 32)) {
219	printf("listen failed\n");
220	return -1;
221	}
222
223	while (1) {
224	SOCKET_TYPE client_socket = accept(listen_socket, NULL, NULL);
225	if (client_socket == INVALID_SOCKET_VALUE) {
226	printf("accept failed\n");
227	continue;
228	}
229
230	char request_buffer[4096];
231	int len = recv(client_socket, request_buffer, sizeof(request_buffer), 0);
232	if (len < 0) {
233	printf("recv failed\n");
234	CLOSE_SOCKET(client_socket);
235	continue;
236	}
237
238	HTTPRequest parsed_request;
239	if (!parse_request((string) {request_buffer, len}, &parsed_request)) {
240	// Parsing failed
241	char response_buffer[] =
242	"HTTP/1.0 400 Bad Request\r\n"
243	"Content-Length: 0\r\n"
244	"\r\n";
245	send_all(client_socket, response_buffer, sizeof(response_buffer));
246
247	} else {
248	// Parsing succeded
249	char response_buffer[] =
250	"HTTP/1.0 200 OK\r\n"
251	"Content-Length: 13\r\n"
252	"Content-Type: text/plain\r\n"
253	"\r\n"
254	"Hello, world!";
255	send_all(client_socket, response_buffer, sizeof(response_buffer));
256	}
257	CLOSE_SOCKET(client_socket);
258	}
259	// This point will never be reached
260	}
261

Handle partial reads

As we did with send, we need to make sure our call to recv read the entire request head. It is possible we read only part of the request head, in which case we need to call the recv function again. We can stop when we find the \r\n\r\n, which signifies the end of the head and start of the body. If we fill up the buffer before we find such token, we consider that an error.

1	int recv_request_head(SOCKET_TYPE sock, char dst, int max, int head_len)
2	{
3	int received = 0;
4	while (1) {
5	int just_received = recv(sock, dst + received, max - received, 0);
6	if (just_received < 0) return -1;
7	received += just_received;
8
9	// Look for \r\n\r\n
10	int i = 0;
11	while (3 < received - i
12	&& (dst[i+0] != '\r'
13	\|\| dst[i+1] != '\n'
14	\|\| dst[i+2] != '\r'
15	\|\| dst[i+3] != '\n'))
16	i++;
17	if (3 < received - i) {
18	// We found the \r\n\r\n and it is at position "i"
19	// We consider the head to go from the first byte to the last \n
20	*head_len = i + 4;
21	break;
22	}
23	// We did not find the end of the head. If the buffer is now full that's an error
24	if (received == max)
25	return -1;
26	}
27	// If we are here we received the head. Note that we may have also read some bytes after the \r\n\r\n, which are part of the request body.
28	return received;
29	}
30

With this, our main loop changes:

1	// ... parsing and everything else stays the same ...
2
3	int main()
4	{
5	// ... this code stays the same too ...
6
7	while (1) {
8	SOCKET_TYPE client_socket = accept(listen_socket, NULL, NULL);
9	if (client_socket == INVALID_SOCKET_VALUE) {
10	printf("accept failed\n");
11	continue;
12	}
13
14	char request_buffer[4096];
15	int received_total, head_len;
16	received_total = recv_request_head(client_socket, request_buffer, sizeof(request_buffer), &head_len);
17	if (received_total < 0) {
18	printf("recv_request_head failed\n");
19	CLOSE_SOCKET(client_socket);
20	continue;
21	}
22	string request_head = {request_buffer, head_len};
23
24	HTTPRequest parsed_request;
25	if (!parse_request(request_head, &parsed_request)) {
26	// ... unchanged ...
27	} else {
28	// ... unchanged ...
29	}
30	CLOSE_SOCKET(client_socket);
31	}
32	// This point will never be reached
33	}
34

Further improvements

This parser will work well most of the time, but there are a couple corner cases we haven't handled:

We are blindly accepting characters while parsing the headers and path. These strings have specific syntaxes and allowed characters
There is something called a "folded header", which is a header spanning over multiple lines. We don't need to handle them, but recognize and reject them at least.