瀏覽代碼

Post about OTT nimming

master
Chris Smith 5 年之前
父節點
當前提交
23a7ee4b89
共有 2 個檔案被更改,包括 308 行新增0 行删除
  1. 308
    0
      site/content/post/2018-12-09-over-the-top-optimisations-in-nim.md
  2. 二進制
      site/static/res/images/aoc/advent-of-code.png

+ 308
- 0
site/content/post/2018-12-09-over-the-top-optimisations-in-nim.md 查看文件

@@ -0,0 +1,308 @@
1
+---
2
+date: 2018-12-09
3
+title: Over-the-top optimisations with Nim
4
+url: /2018/12/09/over-the-top-optimisations-in-nim/
5
+image: /res/images/aoc/advent-of-code.jpg
6
+description: Sometimes its fun to just abandon good practice and make something *fast*.
7
+---
8
+
9
+For the past few years I've been taking part in
10
+[Eric Wastl](https://twitter.com/ericwastl)'s
11
+[Advent of Code](https://adventofcode.com/), a coding challenge that provides
12
+a 2-part problem each day from the 1st of December through to Christmas Day.
13
+The puzzles are always interesting — especially as they get progressively
14
+harder — and there's an awesome community of folks that share their solutions
15
+in a huge variety of languages.
16
+
17
+To up the ante somewhat, [Shane](https://blog.dataforce.org.uk/) and I usually
18
+have a little informal competition to see who can write the most performant
19
+code. This year, though, Shane went massively overboard and wrote an entire
20
+[benchmarking suite and webapp](https://blog.dataforce.org.uk/2018/08/advent-of-code-benchmarking/)
21
+to measure our performance, which I took as an invitation and personal
22
+challenge to try to beat him every single day.
23
+
24
+For the past three years I'd used Python exclusively, as its vast standard
25
+library and awesome syntax lead to quick and elegant solutions. Unfortunately
26
+it stands no chance, at least on the earlier puzzles, of beating the speed
27
+of Shane's preferred language of PHP. For a while I consoled myself with the
28
+notion that once the challenges get more complicated I'd be in with a shot,
29
+but after the third or fourth time that Shane's solution finished before
30
+the Python interpreter even started[^1] I decided I'd have to jump ship. I
31
+started using Nim.
32
+
33
+<!--more-->
34
+
35
+### Introducing Nim
36
+
37
+[Nim](https://nim-lang.org/), formerly Nimrod, is a compiled language that
38
+takes a lot of cues from Python. It has a very nice and familiar syntax,
39
+a reasonable standard library, and it's *fast*. I'd thought about learning
40
+it before but didn't really have anything suitable to use it on, until now.
41
+
42
+The code I used for my day one part one answer looks like this in Nim:
43
+
44
+{{< highlight nim >}}
45
+import math, sequtils, strutils
46
+
47
+echo readFile("data/01.txt").strip.splitLines.map(parseInt).sum
48
+{{< / highlight >}}
49
+
50
+It's a one liner that Python would be proud of. The difference with Nim,
51
+though, is that this compiles down to C, and from there you get all
52
+the benefits of an optimising C compiler and linker. You end up with
53
+a blazingly fast stand-alone binary.
54
+
55
+### Losing my marbles
56
+
57
+[Day 9](https://adventofcode.com/2018/day/9) of this year's Advent of
58
+Code proved interesting to optimise, and I'm going to walk through some
59
+of the steps I took and their impact. I'm in no way a Nim expert and
60
+this is for a program that will be ran once and then thrown away, so
61
+please don't take this too much to heart.
62
+
63
+Day 9 presents a marble game played by Santa's elves, whereby marbles
64
+with increasing values are added to a circle according to certain
65
+rules; every 23rd marble is special and the elf playing it gets to
66
+keep that one and also pick up a marble a certain number of places
67
+away. The winner is the one with the highest marble value at the end.
68
+It doesn't sound like a particularly thrilling game, but as far as
69
+I can tell there's no way to easily predict the winner without
70
+simulating it step-by-step so it makes for an interesting problem.
71
+
72
+### Naive solution: over 10 minutes
73
+
74
+My puzzle input called for a game with 72,104 marbles. My initial approach was
75
+to use a sequence (similar to a list) to store the values of the marbles as
76
+they're added to the circle. This got an answer for part 1 in a about 10
77
+seconds and put me at number 124 on the global leaderboard for fastest
78
+completion. Unfortunately, when part 2 was revealed it asked me to calculate
79
+the result if there were 7,210,400 marbles in play.
80
+
81
+Obviously a puzzle 100x larger would take at least 100x longer to run, and
82
+almost certainly a lot more than that. There isn't a way to calculate the
83
+advance stages more quickly, so the only thing to be done is to make it
84
+run a lot faster. Seven million iterations isn't really *that* much of a
85
+burden for a modern CPU: for the code to be running this slowly the
86
+execution time of some of the operations must be scaling with the number
87
+of marbles. A quick look through the documentation reveals:
88
+
89
+{{< highlight text >}}
90
+proc del[T](x: var seq[T]; i: Natural) {...}
91
+deletes the item at index i by putting x[high(x)] into position i.
92
+This is an O(1) operation.
93
+
94
+proc delete[T](x: var seq[T]; i: Natural) {...}
95
+deletes the item at index i by moving x[i+1..] by one position.
96
+This is an O(n) operation.
97
+{{< / highlight >}}
98
+
99
+Because we have to delete a marble at an arbitrary point and maintain the
100
+ordering of the others, I was using the `delete()` proc which has an O(n)
101
+runtime. The other potentially costly operation is inserting a new marble;
102
+the documentation doesn't mention the runtime but all of the nim docs have
103
+a direct link to the source code, and we can [see](https://github.com/nim-lang/Nim/blob/72e15ff739cc73fbf6e3090756d3f9cb3d5af2fa/lib/system.nim#L1561)
104
+that inserting an element requires iterating over all the elements after it,
105
+so it's also O(n) in the worse case.
106
+
107
+### DoublyLinkedLists: ~500ms
108
+
109
+When you need performant inserts and deletes in a list, the go-to solution
110
+is a linked list. Because nodes store references to their neighbours
111
+(instead of being stored consecutively in an array or list), delete and
112
+insert operations are O(1): you simply need to change a few pointers. Nim's
113
+[lists package](https://nim-lang.org/docs/lists.html) provides a
114
+convenient `DoublyLinkedList` that I went ahead and used.
115
+
116
+Instead of using the old `insert` and `delete` methods I now had my own
117
+which simply manipulate the nodes' previous and next pointers:
118
+
119
+{{< highlight nim >}}
120
+func insertAfter(node: DoublyLinkedNode[int], value: int) =
121
+    var newNode = newDoublyLinkedNode(value)
122
+    newNode.next = node.next
123
+    newNode.prev = node
124
+    newNode.next.prev = newNode
125
+    newNode.prev.next = newNode
126
+
127
+ func remove(node: DoublyLinkedNode[int]) =
128
+    node.prev.next = node.next
129
+    node.next.prev = node.prev
130
+{{< / highlight >}}
131
+
132
+This implementation brought the runtime down to a respectable 500ms,
133
+which handily beat Shane's PHP implementation. It was still an order of
134
+magnitude longer than any of my other solutions, though, so I wasn't
135
+happy yet.
136
+
137
+### Reduced imports: ~470ms
138
+
139
+One thing I was conscious of from trying to make Python performant was how
140
+the number of imports can pile on to startup time. I had a couple of unused
141
+imports that were easy to shed, and I also decided to implement my own
142
+linked list in favour of nim's `lists` module. All this involved was
143
+defining a type and then replacing my usages of `DoublyLinkedNode[int]`
144
+with my new `Marble`.
145
+
146
+{{< highlight nim >}}
147
+type
148
+    Marble = ref object
149
+        next, prev: Marble
150
+        value: int
151
+{{< / highlight >}}
152
+
153
+These few changes didn't have a huge impact, but I was clutching at
154
+straws and every 30ms was a small victory.
155
+
156
+### Inlining methods and small optimisations: ~420ms
157
+
158
+Thinking the code was about as fast as I was going to get it, I made
159
+a final pass to see if there were any little tweaks I could make.
160
+First off, I added the `inline` pragma to my insert and remove methods,
161
+to hint to the C compiler that they should be inlined. I was concerned
162
+that the overhead of calling a function seven million times would add up,
163
+and inlining the fairly simple operation seems reasonable. It's entirely
164
+possible the C compiler was already doing this (they're pretty clever),
165
+but making the hint explicit in Nim is really easy so there's nothing to lose:
166
+
167
+{{< highlight nim >}}
168
+func insertAfter(node: Marble, value: int) {.inline.} =
169
+    var newNode = new(Marble)
170
+    newNode.value = value
171
+    newNode.next = node.next
172
+    newNode.prev = node
173
+    newNode.next.prev = newNode
174
+    newNode.prev.next = newNode
175
+
176
+func remove(node: Marble) {.inline.} =
177
+    node.prev.next = node.next
178
+    node.next.prev = node.prev
179
+{{< / highlight >}}
180
+
181
+I also made some small algorithmic tweaks. These are usually the bread
182
+and butter of optimisations but for this problem there were only a couple
183
+I could see:
184
+
185
+- We only care about the current player every 23rd marble, so instead
186
+  of tracking the player each turn we can just calculate a 23 player
187
+  jump when needed
188
+- Instead of testing whether the current marble is divisible by 23,
189
+  which is potentially non-trivial for large numbers, we can use a
190
+  separate variable that just counts down from 23 and gets reset
191
+- Instead of calculating the boundary condition (`100 * marbles`) whenever
192
+  it's used, we can put this in a variable and calculate it once up-front.
193
+  (The C compiler probably handled this for us anyway)
194
+
195
+These combination of tweaks saved another 50ms, and it seemed like there
196
+wasn't a whole lot left that could possibly change.
197
+
198
+### Non-reference counted objects: ~180ms
199
+
200
+While I was pondering further improvements, Shane mentioned that he managed
201
+to make PHP's garbage collector segfault with his solution. That got me
202
+thinking: what would happen if Nim didn't have to worry about garbage
203
+collecting our marbles? We have a fixed amount of them and don't need to
204
+worry about memory leaks as the program runs for half a second and then
205
+quits. Changing the Marble type and manually allocating memory for it
206
+— something that is virtually impossible in languages like PHP or Python —
207
+was trivial in Nim:
208
+
209
+{{< highlight nim >}}
210
+type
211
+    Marble = object
212
+        next, prev: ptr Marble
213
+        value: int32
214
+
215
+proc insertAfter(node: ptr Marble, value: int) {.inline.} =
216
+    var newNode = cast[ptr Marble](alloc0(sizeof(Marble)))
217
+{{< / highlight >}}
218
+
219
+Taking the garbage collector out of the equation over doubled the performance!
220
+Still, it was my only solution that took more than 100ms and that bothered me...
221
+
222
+### No looking back: ~120ms
223
+
224
+Thinking about memory allocations made me take a hard look at the structure
225
+of the `Marble` type. Each of the seven million marbles has a previous pointer
226
+that we only use to backtrack by a fixed amount every 23rd play, which seems
227
+wasteful. If we reduce the amount of memory we have to allocate, we'll logically
228
+reduce the time taken allocating it.
229
+
230
+As the game is simulated we keep track of the "current" marble, so why not
231
+keep track of the marble eight behind that? That would allow us to turn the
232
+doubly-linked list into a singly-linked list and save a whole bunch of memory.
233
+This ends up being slightly complicated as initially there aren't eight marbles,
234
+and every 23rd play we jump the current position backwards (and without
235
+previous pointers, we can't jump the "current minus eight" pointer backwards).
236
+
237
+To work around these issues, I added a "trailing" pointer that gradually drifts
238
+backwards to eight behind the current pointer as moves are played. There are
239
+22 normal moves that each advance the current pointer by two, so there's plenty
240
+of time for this to happen.
241
+
242
+{{< highlight nim >}}
243
+var
244
+    currentTrail = current
245
+    currentTrailDrift = 0
246
+
247
+# When a standard move occurs:
248
+current.next.insertAfter(i)
249
+current = current.next.next
250
+if currentTrailDrift == 8:
251
+    # Keep the trail eight marbles behind the current one
252
+    currentTrail = currentTrail.next.next
253
+else:
254
+    # Don't move the trail so it drifts away by two marbles
255
+    currentTrailDrift += 2
256
+{{< / highlight >}}
257
+
258
+This is one of those optimisations that makes the code a bit harder to follow,
259
+but it sliced a third of the runtime off and takes us tantalisingly close to
260
+that 100ms threshold.
261
+
262
+### One bulk order of memory, please: ~50ms
263
+
264
+Thinking about memory allocations, I realised we were doing seven million small
265
+allocations over the lifetime of the program. We know upfront how many marbles
266
+there are going to be and will need to allocate memory for them all at some
267
+point, so why not just do it in one big bang?
268
+
269
+Fortunately, again, Nim lets you dive from the high-level Python-like world
270
+down to the nitty-gritty of memory management and pointers without blinking.
271
+Now after reading the puzzle input, I allocate a big chunk of memory (for my
272
+input with seven million marbles this equates to around 86MB of RAM) and keep
273
+a pointer to it:
274
+
275
+{{< highlight nim >}}
276
+let
277
+    hundredMarbles = marbles * 100
278
+    memory = alloc(MarbleSize * hundredMarbles)
279
+{{< / highlight >}}
280
+
281
+Then when it comes to creating a "new" Marble, we simply calculate the
282
+position in our memory block and use it as a pointer:
283
+
284
+{{< highlight nim >}}
285
+proc addressOf(memory: pointer, marbleNumber: int): Marble {.inline.} =
286
+    cast[Marble](cast[uint](memory) + cast[uint](marbleNumber * MarbleSize))
287
+
288
+proc insertAfter(node: Marble, memory: pointer, value: int): Marble {.inline.} =
289
+    var newNode = memory.addressOf(value)
290
+{{< / highlight >}}
291
+
292
+Changing to this one-time allocation more than halved the runtime of the
293
+program, placing it firmly under the 100ms target I was aiming at. It's
294
+particularly pleasing how little effort was required for optimisations like
295
+this, and how you can switch from high-level Python-style code to low-level
296
+C-style pointer manipulation.
297
+
298
+----
299
+
300
+You can find the full code to my solution in my [aoc-2018](https://github.com/csmith/aoc-2018)
301
+repository. If you're not taking part in [Advent of Code](https://adventofcode.com/)
302
+I highly recommend it, and if you've not used [Nim](https://nim-lang.org/)
303
+it's definitely worth a look.
304
+
305
+[^1]: PHP has always been fast to start, due to its primary use in a CGI
306
+      environment, and the last few major versions of PHP have made its
307
+      unbelievbably blazingly fast as well, while Python unfortunately
308
+      [has issues with startup time](https://mail.python.org/pipermail/python-dev/2018-May/153296.html)

二進制
site/static/res/images/aoc/advent-of-code.png 查看文件


Loading…
取消
儲存