|
@@ -0,0 +1,308 @@
|
|
1
|
+---
|
|
2
|
+date: 2018-12-09
|
|
3
|
+title: Over-the-top optimisations with Nim
|
|
4
|
+url: /2018/12/09/over-the-top-optimisations-in-nim/
|
|
5
|
+image: /res/images/aoc/advent-of-code.jpg
|
|
6
|
+description: Sometimes its fun to just abandon good practice and make something *fast*.
|
|
7
|
+---
|
|
8
|
+
|
|
9
|
+For the past few years I've been taking part in
|
|
10
|
+[Eric Wastl](https://twitter.com/ericwastl)'s
|
|
11
|
+[Advent of Code](https://adventofcode.com/), a coding challenge that provides
|
|
12
|
+a 2-part problem each day from the 1st of December through to Christmas Day.
|
|
13
|
+The puzzles are always interesting — especially as they get progressively
|
|
14
|
+harder — and there's an awesome community of folks that share their solutions
|
|
15
|
+in a huge variety of languages.
|
|
16
|
+
|
|
17
|
+To up the ante somewhat, [Shane](https://blog.dataforce.org.uk/) and I usually
|
|
18
|
+have a little informal competition to see who can write the most performant
|
|
19
|
+code. This year, though, Shane went massively overboard and wrote an entire
|
|
20
|
+[benchmarking suite and webapp](https://blog.dataforce.org.uk/2018/08/advent-of-code-benchmarking/)
|
|
21
|
+to measure our performance, which I took as an invitation and personal
|
|
22
|
+challenge to try to beat him every single day.
|
|
23
|
+
|
|
24
|
+For the past three years I'd used Python exclusively, as its vast standard
|
|
25
|
+library and awesome syntax lead to quick and elegant solutions. Unfortunately
|
|
26
|
+it stands no chance, at least on the earlier puzzles, of beating the speed
|
|
27
|
+of Shane's preferred language of PHP. For a while I consoled myself with the
|
|
28
|
+notion that once the challenges get more complicated I'd be in with a shot,
|
|
29
|
+but after the third or fourth time that Shane's solution finished before
|
|
30
|
+the Python interpreter even started[^1] I decided I'd have to jump ship. I
|
|
31
|
+started using Nim.
|
|
32
|
+
|
|
33
|
+<!--more-->
|
|
34
|
+
|
|
35
|
+### Introducing Nim
|
|
36
|
+
|
|
37
|
+[Nim](https://nim-lang.org/), formerly Nimrod, is a compiled language that
|
|
38
|
+takes a lot of cues from Python. It has a very nice and familiar syntax,
|
|
39
|
+a reasonable standard library, and it's *fast*. I'd thought about learning
|
|
40
|
+it before but didn't really have anything suitable to use it on, until now.
|
|
41
|
+
|
|
42
|
+The code I used for my day one part one answer looks like this in Nim:
|
|
43
|
+
|
|
44
|
+{{< highlight nim >}}
|
|
45
|
+import math, sequtils, strutils
|
|
46
|
+
|
|
47
|
+echo readFile("data/01.txt").strip.splitLines.map(parseInt).sum
|
|
48
|
+{{< / highlight >}}
|
|
49
|
+
|
|
50
|
+It's a one liner that Python would be proud of. The difference with Nim,
|
|
51
|
+though, is that this compiles down to C, and from there you get all
|
|
52
|
+the benefits of an optimising C compiler and linker. You end up with
|
|
53
|
+a blazingly fast stand-alone binary.
|
|
54
|
+
|
|
55
|
+### Losing my marbles
|
|
56
|
+
|
|
57
|
+[Day 9](https://adventofcode.com/2018/day/9) of this year's Advent of
|
|
58
|
+Code proved interesting to optimise, and I'm going to walk through some
|
|
59
|
+of the steps I took and their impact. I'm in no way a Nim expert and
|
|
60
|
+this is for a program that will be ran once and then thrown away, so
|
|
61
|
+please don't take this too much to heart.
|
|
62
|
+
|
|
63
|
+Day 9 presents a marble game played by Santa's elves, whereby marbles
|
|
64
|
+with increasing values are added to a circle according to certain
|
|
65
|
+rules; every 23rd marble is special and the elf playing it gets to
|
|
66
|
+keep that one and also pick up a marble a certain number of places
|
|
67
|
+away. The winner is the one with the highest marble value at the end.
|
|
68
|
+It doesn't sound like a particularly thrilling game, but as far as
|
|
69
|
+I can tell there's no way to easily predict the winner without
|
|
70
|
+simulating it step-by-step so it makes for an interesting problem.
|
|
71
|
+
|
|
72
|
+### Naive solution: over 10 minutes
|
|
73
|
+
|
|
74
|
+My puzzle input called for a game with 72,104 marbles. My initial approach was
|
|
75
|
+to use a sequence (similar to a list) to store the values of the marbles as
|
|
76
|
+they're added to the circle. This got an answer for part 1 in a about 10
|
|
77
|
+seconds and put me at number 124 on the global leaderboard for fastest
|
|
78
|
+completion. Unfortunately, when part 2 was revealed it asked me to calculate
|
|
79
|
+the result if there were 7,210,400 marbles in play.
|
|
80
|
+
|
|
81
|
+Obviously a puzzle 100x larger would take at least 100x longer to run, and
|
|
82
|
+almost certainly a lot more than that. There isn't a way to calculate the
|
|
83
|
+advance stages more quickly, so the only thing to be done is to make it
|
|
84
|
+run a lot faster. Seven million iterations isn't really *that* much of a
|
|
85
|
+burden for a modern CPU: for the code to be running this slowly the
|
|
86
|
+execution time of some of the operations must be scaling with the number
|
|
87
|
+of marbles. A quick look through the documentation reveals:
|
|
88
|
+
|
|
89
|
+{{< highlight text >}}
|
|
90
|
+proc del[T](x: var seq[T]; i: Natural) {...}
|
|
91
|
+deletes the item at index i by putting x[high(x)] into position i.
|
|
92
|
+This is an O(1) operation.
|
|
93
|
+
|
|
94
|
+proc delete[T](x: var seq[T]; i: Natural) {...}
|
|
95
|
+deletes the item at index i by moving x[i+1..] by one position.
|
|
96
|
+This is an O(n) operation.
|
|
97
|
+{{< / highlight >}}
|
|
98
|
+
|
|
99
|
+Because we have to delete a marble at an arbitrary point and maintain the
|
|
100
|
+ordering of the others, I was using the `delete()` proc which has an O(n)
|
|
101
|
+runtime. The other potentially costly operation is inserting a new marble;
|
|
102
|
+the documentation doesn't mention the runtime but all of the nim docs have
|
|
103
|
+a direct link to the source code, and we can [see](https://github.com/nim-lang/Nim/blob/72e15ff739cc73fbf6e3090756d3f9cb3d5af2fa/lib/system.nim#L1561)
|
|
104
|
+that inserting an element requires iterating over all the elements after it,
|
|
105
|
+so it's also O(n) in the worse case.
|
|
106
|
+
|
|
107
|
+### DoublyLinkedLists: ~500ms
|
|
108
|
+
|
|
109
|
+When you need performant inserts and deletes in a list, the go-to solution
|
|
110
|
+is a linked list. Because nodes store references to their neighbours
|
|
111
|
+(instead of being stored consecutively in an array or list), delete and
|
|
112
|
+insert operations are O(1): you simply need to change a few pointers. Nim's
|
|
113
|
+[lists package](https://nim-lang.org/docs/lists.html) provides a
|
|
114
|
+convenient `DoublyLinkedList` that I went ahead and used.
|
|
115
|
+
|
|
116
|
+Instead of using the old `insert` and `delete` methods I now had my own
|
|
117
|
+which simply manipulate the nodes' previous and next pointers:
|
|
118
|
+
|
|
119
|
+{{< highlight nim >}}
|
|
120
|
+func insertAfter(node: DoublyLinkedNode[int], value: int) =
|
|
121
|
+ var newNode = newDoublyLinkedNode(value)
|
|
122
|
+ newNode.next = node.next
|
|
123
|
+ newNode.prev = node
|
|
124
|
+ newNode.next.prev = newNode
|
|
125
|
+ newNode.prev.next = newNode
|
|
126
|
+
|
|
127
|
+ func remove(node: DoublyLinkedNode[int]) =
|
|
128
|
+ node.prev.next = node.next
|
|
129
|
+ node.next.prev = node.prev
|
|
130
|
+{{< / highlight >}}
|
|
131
|
+
|
|
132
|
+This implementation brought the runtime down to a respectable 500ms,
|
|
133
|
+which handily beat Shane's PHP implementation. It was still an order of
|
|
134
|
+magnitude longer than any of my other solutions, though, so I wasn't
|
|
135
|
+happy yet.
|
|
136
|
+
|
|
137
|
+### Reduced imports: ~470ms
|
|
138
|
+
|
|
139
|
+One thing I was conscious of from trying to make Python performant was how
|
|
140
|
+the number of imports can pile on to startup time. I had a couple of unused
|
|
141
|
+imports that were easy to shed, and I also decided to implement my own
|
|
142
|
+linked list in favour of nim's `lists` module. All this involved was
|
|
143
|
+defining a type and then replacing my usages of `DoublyLinkedNode[int]`
|
|
144
|
+with my new `Marble`.
|
|
145
|
+
|
|
146
|
+{{< highlight nim >}}
|
|
147
|
+type
|
|
148
|
+ Marble = ref object
|
|
149
|
+ next, prev: Marble
|
|
150
|
+ value: int
|
|
151
|
+{{< / highlight >}}
|
|
152
|
+
|
|
153
|
+These few changes didn't have a huge impact, but I was clutching at
|
|
154
|
+straws and every 30ms was a small victory.
|
|
155
|
+
|
|
156
|
+### Inlining methods and small optimisations: ~420ms
|
|
157
|
+
|
|
158
|
+Thinking the code was about as fast as I was going to get it, I made
|
|
159
|
+a final pass to see if there were any little tweaks I could make.
|
|
160
|
+First off, I added the `inline` pragma to my insert and remove methods,
|
|
161
|
+to hint to the C compiler that they should be inlined. I was concerned
|
|
162
|
+that the overhead of calling a function seven million times would add up,
|
|
163
|
+and inlining the fairly simple operation seems reasonable. It's entirely
|
|
164
|
+possible the C compiler was already doing this (they're pretty clever),
|
|
165
|
+but making the hint explicit in Nim is really easy so there's nothing to lose:
|
|
166
|
+
|
|
167
|
+{{< highlight nim >}}
|
|
168
|
+func insertAfter(node: Marble, value: int) {.inline.} =
|
|
169
|
+ var newNode = new(Marble)
|
|
170
|
+ newNode.value = value
|
|
171
|
+ newNode.next = node.next
|
|
172
|
+ newNode.prev = node
|
|
173
|
+ newNode.next.prev = newNode
|
|
174
|
+ newNode.prev.next = newNode
|
|
175
|
+
|
|
176
|
+func remove(node: Marble) {.inline.} =
|
|
177
|
+ node.prev.next = node.next
|
|
178
|
+ node.next.prev = node.prev
|
|
179
|
+{{< / highlight >}}
|
|
180
|
+
|
|
181
|
+I also made some small algorithmic tweaks. These are usually the bread
|
|
182
|
+and butter of optimisations but for this problem there were only a couple
|
|
183
|
+I could see:
|
|
184
|
+
|
|
185
|
+- We only care about the current player every 23rd marble, so instead
|
|
186
|
+ of tracking the player each turn we can just calculate a 23 player
|
|
187
|
+ jump when needed
|
|
188
|
+- Instead of testing whether the current marble is divisible by 23,
|
|
189
|
+ which is potentially non-trivial for large numbers, we can use a
|
|
190
|
+ separate variable that just counts down from 23 and gets reset
|
|
191
|
+- Instead of calculating the boundary condition (`100 * marbles`) whenever
|
|
192
|
+ it's used, we can put this in a variable and calculate it once up-front.
|
|
193
|
+ (The C compiler probably handled this for us anyway)
|
|
194
|
+
|
|
195
|
+These combination of tweaks saved another 50ms, and it seemed like there
|
|
196
|
+wasn't a whole lot left that could possibly change.
|
|
197
|
+
|
|
198
|
+### Non-reference counted objects: ~180ms
|
|
199
|
+
|
|
200
|
+While I was pondering further improvements, Shane mentioned that he managed
|
|
201
|
+to make PHP's garbage collector segfault with his solution. That got me
|
|
202
|
+thinking: what would happen if Nim didn't have to worry about garbage
|
|
203
|
+collecting our marbles? We have a fixed amount of them and don't need to
|
|
204
|
+worry about memory leaks as the program runs for half a second and then
|
|
205
|
+quits. Changing the Marble type and manually allocating memory for it
|
|
206
|
+— something that is virtually impossible in languages like PHP or Python —
|
|
207
|
+was trivial in Nim:
|
|
208
|
+
|
|
209
|
+{{< highlight nim >}}
|
|
210
|
+type
|
|
211
|
+ Marble = object
|
|
212
|
+ next, prev: ptr Marble
|
|
213
|
+ value: int32
|
|
214
|
+
|
|
215
|
+proc insertAfter(node: ptr Marble, value: int) {.inline.} =
|
|
216
|
+ var newNode = cast[ptr Marble](alloc0(sizeof(Marble)))
|
|
217
|
+{{< / highlight >}}
|
|
218
|
+
|
|
219
|
+Taking the garbage collector out of the equation over doubled the performance!
|
|
220
|
+Still, it was my only solution that took more than 100ms and that bothered me...
|
|
221
|
+
|
|
222
|
+### No looking back: ~120ms
|
|
223
|
+
|
|
224
|
+Thinking about memory allocations made me take a hard look at the structure
|
|
225
|
+of the `Marble` type. Each of the seven million marbles has a previous pointer
|
|
226
|
+that we only use to backtrack by a fixed amount every 23rd play, which seems
|
|
227
|
+wasteful. If we reduce the amount of memory we have to allocate, we'll logically
|
|
228
|
+reduce the time taken allocating it.
|
|
229
|
+
|
|
230
|
+As the game is simulated we keep track of the "current" marble, so why not
|
|
231
|
+keep track of the marble eight behind that? That would allow us to turn the
|
|
232
|
+doubly-linked list into a singly-linked list and save a whole bunch of memory.
|
|
233
|
+This ends up being slightly complicated as initially there aren't eight marbles,
|
|
234
|
+and every 23rd play we jump the current position backwards (and without
|
|
235
|
+previous pointers, we can't jump the "current minus eight" pointer backwards).
|
|
236
|
+
|
|
237
|
+To work around these issues, I added a "trailing" pointer that gradually drifts
|
|
238
|
+backwards to eight behind the current pointer as moves are played. There are
|
|
239
|
+22 normal moves that each advance the current pointer by two, so there's plenty
|
|
240
|
+of time for this to happen.
|
|
241
|
+
|
|
242
|
+{{< highlight nim >}}
|
|
243
|
+var
|
|
244
|
+ currentTrail = current
|
|
245
|
+ currentTrailDrift = 0
|
|
246
|
+
|
|
247
|
+# When a standard move occurs:
|
|
248
|
+current.next.insertAfter(i)
|
|
249
|
+current = current.next.next
|
|
250
|
+if currentTrailDrift == 8:
|
|
251
|
+ # Keep the trail eight marbles behind the current one
|
|
252
|
+ currentTrail = currentTrail.next.next
|
|
253
|
+else:
|
|
254
|
+ # Don't move the trail so it drifts away by two marbles
|
|
255
|
+ currentTrailDrift += 2
|
|
256
|
+{{< / highlight >}}
|
|
257
|
+
|
|
258
|
+This is one of those optimisations that makes the code a bit harder to follow,
|
|
259
|
+but it sliced a third of the runtime off and takes us tantalisingly close to
|
|
260
|
+that 100ms threshold.
|
|
261
|
+
|
|
262
|
+### One bulk order of memory, please: ~50ms
|
|
263
|
+
|
|
264
|
+Thinking about memory allocations, I realised we were doing seven million small
|
|
265
|
+allocations over the lifetime of the program. We know upfront how many marbles
|
|
266
|
+there are going to be and will need to allocate memory for them all at some
|
|
267
|
+point, so why not just do it in one big bang?
|
|
268
|
+
|
|
269
|
+Fortunately, again, Nim lets you dive from the high-level Python-like world
|
|
270
|
+down to the nitty-gritty of memory management and pointers without blinking.
|
|
271
|
+Now after reading the puzzle input, I allocate a big chunk of memory (for my
|
|
272
|
+input with seven million marbles this equates to around 86MB of RAM) and keep
|
|
273
|
+a pointer to it:
|
|
274
|
+
|
|
275
|
+{{< highlight nim >}}
|
|
276
|
+let
|
|
277
|
+ hundredMarbles = marbles * 100
|
|
278
|
+ memory = alloc(MarbleSize * hundredMarbles)
|
|
279
|
+{{< / highlight >}}
|
|
280
|
+
|
|
281
|
+Then when it comes to creating a "new" Marble, we simply calculate the
|
|
282
|
+position in our memory block and use it as a pointer:
|
|
283
|
+
|
|
284
|
+{{< highlight nim >}}
|
|
285
|
+proc addressOf(memory: pointer, marbleNumber: int): Marble {.inline.} =
|
|
286
|
+ cast[Marble](cast[uint](memory) + cast[uint](marbleNumber * MarbleSize))
|
|
287
|
+
|
|
288
|
+proc insertAfter(node: Marble, memory: pointer, value: int): Marble {.inline.} =
|
|
289
|
+ var newNode = memory.addressOf(value)
|
|
290
|
+{{< / highlight >}}
|
|
291
|
+
|
|
292
|
+Changing to this one-time allocation more than halved the runtime of the
|
|
293
|
+program, placing it firmly under the 100ms target I was aiming at. It's
|
|
294
|
+particularly pleasing how little effort was required for optimisations like
|
|
295
|
+this, and how you can switch from high-level Python-style code to low-level
|
|
296
|
+C-style pointer manipulation.
|
|
297
|
+
|
|
298
|
+----
|
|
299
|
+
|
|
300
|
+You can find the full code to my solution in my [aoc-2018](https://github.com/csmith/aoc-2018)
|
|
301
|
+repository. If you're not taking part in [Advent of Code](https://adventofcode.com/)
|
|
302
|
+I highly recommend it, and if you've not used [Nim](https://nim-lang.org/)
|
|
303
|
+it's definitely worth a look.
|
|
304
|
+
|
|
305
|
+[^1]: PHP has always been fast to start, due to its primary use in a CGI
|
|
306
|
+ environment, and the last few major versions of PHP have made its
|
|
307
|
+ unbelievbably blazingly fast as well, while Python unfortunately
|
|
308
|
+ [has issues with startup time](https://mail.python.org/pipermail/python-dev/2018-May/153296.html)
|