Write Robust Code With Soft State

While programming, I've noticed that the variables I use to hold state often feel redundant, since some variable values can be inferred from others. I began thinking of this as "redundancy" and developed an intuition for distinguishing between my program's "core state" and "redundant state".

Over time, I developed an approach to handle this redundancy that allows for more robust programs. I thought I'd share my intuitions here.

Programs Have Redundant State

First of all, let me show what I mean by redundant state.

Consider this code:

1	char buffer[100];
2	int used = 0;
3	int unused = 100;
4

The two variables count how many used and unused bytes are in the buffer. The buffer size is fixed, so the sum of the variable is always 100. This means each variable can be inferred from the other. For this reason I would call these variables "redundant" as you only need one. I would define as "core state" that which can't be inferred from other state. The core state in this example is the buffer and one of the two variables.

Now consider this code:

1	typedef struct Node Node;
2	struct Node {
3	Node *next;
4	int value;
5	};
6
7	Node *list;
8	int count;
9

It has a plain old linked list and a counter for the number of nodes. Here, the counter variable is redundant as it can be calculated by iterating over the nodes of the list.

We could define redundant state as that which can be computed from the remaining state, which I'm calling core state.

The Function of Redundancy

I remember asking myself after realizing this: what would happen if I wrote programs with no redundant state at all?

I would expect very inefficient programs, but also very robust ones.

When you add redundant state to your codebase, you need to add logic to make sure it's always synchronized with the rest of the data. For instance, in the list scenario you need to increment and decrement the counter when the list is modified. Often we forget to add such logic, causing the system to break. If the derived state weren't there, this type of mistake would not be possible.

On the other hand, the reason we add redundancy is to avoid unnecessary computations. It's true that we can infer the number of nodes of a list by iterating over it, but a counter variable would avoid that entirely at the cost of incrementing and decrementing the counter when we modify the list. Of course there is the chance we forget to do that, but it's a good trade-off overall.

Hard & Soft State

I introduced the concepts of "core state" versus "redundant state", but there is another characterization of state that comes in handy here: the difference between soft and hard state.

I would define hard state as the data necessary for the program to run, and soft state as that which is not necessary but improves performance. The prime example of soft state is caches, which are the unsung heroes of software architecture.

Going back to our initial discussion, both the core and derived state in our program are hard state. The redundant state is assumed to always be up to date with the core state. If not, the program is broken.

But there is no reason why the redundant state cannot be treated as soft state. It should be possible to design our program to use redundant state but not require it. This would allow us to use the redundancy when convenient, but then have the option to throw it away when hard to manage.

Let me show an example.

Example

We are implementing a small social network where users can post articles and comment on them.

1	typedef char Username[60];
2	typedef char Content[100];
3
4	typedef struct {
5	int id;
6	Username author;
7	Content content;
8	int upvotes;
9	int downvotes;
10	} Comment;
11
12	typedef struct {
13	Username author;
14	Content content;
15	int num_comments;
16	Comment comments[100];
17	int upvotes;
18	int downvotes;
19	} Post;
20
21	int next_comment_id = 0;
22	Post posts[100];
23
24	void upvote_post(Post *post)
25	{
26	post->upvotes++;
27	}
28
29	void downvote_post(Post *post)
30	{
31	post->downvotes++;
32	}
33
34	int add_comment(Post *post, Username author, Content content)
35	{
36	Comment *comment = &post->comments[post->num_comments++];
37	int id = init_comment(comment, author, content);
38	return id;
39	}
40
41	void remove_comment(Post *post, int comment_id)
42	{
43	int i = comment_by_id(post, comment_id);
44	post->comments[i] = post->comments[--post->num_comments];
45	}
46
47	void upvote_comment(Post *post, int comment_id)
48	{
49	int i = comment_by_id(post, comment_id);
50	post->comments[i].upvotes++;
51	}
52
53	void downvote_comment(Post *post, int comment_id)
54	{
55	int i = comment_by_id(post, comment_id);
56	post->comments[i].downvotes++;
57	}
58

We later decide that instead of showing posts in chronological order, we want to score them and order them by their scores. Let's say this is the score function we settle upon:

post_score = (post_upvotes - post_downvotes) + 4 * comment_count + 2 * sum { (comment_upvotes[i] - comment_downvotes[i]) over i }

It involves a number of parameters, like upvotes, downvotes, comments.

Hard State Solution

The hard state solution would be to add the score variable to our state and make sure that whenever one of the parameters it depends on changes, the score changes accordingly:

1	typedef char Username[60];
2	typedef char Content[100];
3
4	typedef struct {
5	// ...
6	} Comment;
7
8	typedef struct {
9	// ...
10
11	// New field
12	int score;
13	} Post;
14
15	int next_comment_id = 0;
16	Post posts[100];
17
18	void upvote_post(Post *post)
19	{
20	// ...
21
22	// Update score
23	post->score++;
24	}
25
26	void downvote_post(Post *post)
27	{
28	// ...
29
30	// Update score
31	post->score--;
32	}
33
34	int add_comment(Post *post, Username author, Content content)
35	{
36	// ...
37
38	// Update score
39	post->score += 4;
40
41	return id;
42	}
43
44	void remove_comment(Post *post, int comment_id)
45	{
46	// ...
47
48	// Update score
49	post->score -= 4;
50	post->score -= 2 * post->comments[i].upvotes;
51	post->score += 2 * post->comments[i].downvotes;
52	}
53
54	void upvote_comment(Post *post, int comment_id)
55	{
56	// ...
57
58	// Update score
59	post->score += 2;
60	}
61
62	void downvote_comment(Post *post, int comment_id)
63	{
64	// ...
65
66	// Update score
67	post->score -= 2;
68	}
69
70	int get_score(Post *post)
71	{
72	return post->score;
73	}
74

If we forget to update the score or do it incorrectly, our program is essentially broken. If this were not an example, it would be even worse given the score function is something that changes often. I don't know about you, but this is nightmare fuel to me!

Hard State But No Redundancy Solution

Now let's see what happens when we remove the redundant state entirely. Instead of keeping track of the score, we calculate it on the fly.

Our resultant program will be the same as the first version with no scoring, with the addition of this function:

1	int get_score(Post *post)
2	{
3	int score = post->upvotes - post->downvotes + 4 * post->num_comments;
4
5	for (int i = 0; i < post->num_comments; i++) {
6	score += 2 * post->comments[i].upvotes;
7	score -= 2 * post->comments[i].downvotes;
8	}
9
10	return score;
11	}
12

Now all computation related to the score is in one function.

Soft State Solution

Now let's see the soft state solution. This involves storing the score, but allowing the program to work without it.

1	typedef struct {
2	// ...
3	int cached_score_value;
4	} Post;
5
6	int next_comment_id = 0;
7	Post posts[100];
8
9	void upvote_post(Post *post)
10	{
11	// ...
12
13	post->cached_score_value = -1;
14	}
15
16	void downvote_post(Post *post)
17	{
18	// ...
19
20	post->cached_score_value = -1;
21	}
22
23	int add_comment(Post *post, Username author, Content content)
24	{
25	// ...
26
27	post->cached_score_value = -1;
28	return id;
29	}
30
31	void remove_comment(Post *post, int comment_id)
32	{
33	// ...
34
35	post->cached_score_value = -1;
36	}
37
38	void upvote_comment(Post *post, int comment_id)
39	{
40	// ...
41
42	post->cached_score_value = -1;
43	}
44
45	void downvote_comment(Post *post, int comment_id)
46	{
47	// ...
48
49	post->cached_score_value = -1;
50	}
51
52	int get_score(Post *post)
53	{
54	if (post->cached_score_value == -1) {
55
56	int score = post->upvotes - post->downvotes + 4 * post->num_comments;
57
58	for (int i = 0; i < post->num_comments; i++) {
59	score += 2 * post->comments[i].upvotes;
60	score -= 2 * post->comments[i].downvotes;
61	}
62
63	if (score == -1)
64	score = -2;
65
66	post->cached_score_value = score;
67	}
68
69	return post->cached_score_value;
70	}
71

The cached_score_value variable holds the current score of the post or -1, which means the score is uncomputed.

Regardless of whether the score is computed or not, we can call get_score which will return the current score and cache it for next time.

All the places where the score would change just cancel the current score, causing it to be recomputed from scratch next time. This completely removes the cost of having to mirror the changes to the core state (the post and comments) to the redundant state (the score).

Hybrid Approach

The hard state method optimizes for performance by avoiding the cost of computing the score as much as possible. The soft state technique pays the cost of extra computation to reduce complexity.

But the nice thing about this strategy is that you don't have to choose between performance and robustness: you can have an hybrid approach!

If keeping the redundant state up to date is easy, you can do that. No need to invalidate it unnecessarily. On the other hand if the derived state would be tricky to update, just invalidate it.

In conclusion, the soft state approach offers you a way to opt out of complex operations at the cost of extra computation.