Recursion and concurrency with Go

Eager to jump on the bandwagon, I've been reading up on Go, the new language released by a group at Google. I had nothing to do with the development of the language, but several things about it interest me, in particular, its approach to concurrency.

Some time ago I came across a rather neat language (or rather, language extension) called cilk. Cilk is an extension of ANSI C with concurrency primitives. It implements a concurrency model based on 'spawning' functions, returning a deferred result, then 'syncing' in order to obtain the results. Most interestingly, cilk uses a 'work stealing' scheduler, which means that spawning a function is nearly as cheap as simply calling it.

One of the demonstrations of cilk which really captured my imagination was using it to search a game tree for perfect knowledge games such as chess. Parallelizing tree search is generally a really hard problem, because most of the function invocations only do a small amount of work, and context switching and thread creation overhead often overwhelms any benefits gained from parallelization. Cilk's work stealing strategy made it possible to parallelize these algorithms in an intuitive fashion and still see performance improvements from multiple cores.

Go's concurrency model isn't the same as cilk's, but it's similar. As well as being interesting in its own right, it also brought cilk back to mind, and led me to wonder if Go's concurrency model might allow the same sort of treatment. I set out to find out, by writing a simple implementation of a Connect 4 AI.

First, we need some basic data structures to represent our game state and moves:

package main
 
type state struct {
  board [7][6]int;
  heights [7]int;
}
 
type move struct {
  column int;
  score int;
}

Every Go source file has to be in a package; the startup function must be called 'main' in the module 'main' - hence the name of our package. Next, we define our 'state' and 'move' structures. Anyone familiar with C-like languages will recognize the general syntax here, but note that in Go, the data type is listed after the field name, while array dimensions are listed before the data type.

The 'heights' array in the state data structure is an optimisation. It stores the 'height' in the board array of the first unoccupied cell, allowing us to record a move without having to first loop through the column we wish to use to find the first unoccupied cell. We can thus define the 'move' method simply:

func (s *state) move(column, player int) {
  s.board[column][s.heights[column]] = player;
  s.heights[column]--;
}

move is an example of a Go method. The struct the method is on is declared in parentheses after the 'func' keyword; 's' is the name we will use to refer to the struct the method is being called on. Methods are called in the familiar manner - as s.move(col, player).

Before we try and go all parallel, let's write a simple synchronous implementation of the negamax algorithm, the basic algorithm for exploring game trees:

func negamax_sync(s state, column, player, depth int) move {
  s.move(column, player);
  
  score := s.evaluate() * -player;
  if score < -1000000 || score > 1000000 || depth == 0 {
    return move{column, score};
  }
  
  min_score := 1000000;
  for i := 0; i < 7; i++ {
    if s.heights[i] >= 0 {
      m := negamax_sync(s, i, -1 * player, depth - 1);
      if m.score < min_score {
        min_score = m.score;
      }
    }
  }
  
  return move{column, -min_score};
}

Again, Go looks very similar to C and other related languages. Again note that the types come after the variable names in the function's arguments. Another optimisation is that a sequence of arguments with all the same type (column, player, depth in the above example) only require the datatype to be specified once. Also note the := operator, which initializes a new variable to the type and value of the right hand side of the expression.

Negamax works by exploring the game tree to a given depth, alternating between each player's turn at each level. Each level in a game tree is called a 'ply'. When it reaches the maximum depth, it calls an evaluation function to get an estimate of how good the game state is for the current player, and returns that estimate. We haven't included the estimation function here; it's fairly straightforward, scoring boards based on how many connect 4s (instant win or lose) there are, and how many opportunities there are with 3, 2 or 1 tokens. If you're interested, you can read the complete source.

At internal nodes, Negamax recursively explores the subtrees, evaluating how good each one is for its opponent, and selects the branch where the opponent's best outcome is the worst from the available choices, returning that branch's score (negated, because it's from the opponent's point of view) as its own. This works because in a zero-sum game such as Chess or Connect 4, what's bad for our opponent is good for us. This general search strategy is called 'minimax', for which negamax is simply a refinement.

To use our minimax function to play a game of connect 4, we need another function that evaluates each possible move using negamax_sync, and picks the best one:

func sync_move(s *state, player int) int {
  max := move{-1, -100000000};
  for i := 0; i < 7; i++ {
    if s.heights[i] >= 0 {
      if m := negamax_sync(*s, i, player, 6); m.score > max.score {
        max = m;
      }
    }
  }
  return max.column;
}

This function should be easy to understand; it simply steps through each possible move we can make, and asks negamax_sync to estimate how it scores. We then return the best available option. We've hardcoded in a search to a depth of 6 ply here; on current hardware this gives us a good tradeoff between being too fast to measure accurately and so slow we don't want to wait.

Finally, we need a main function to repeatedly call sync_move and display the result:

func main() {
  var s state;
  s.heights = [7]int{ 5, 5, 5, 5, 5, 5, 5 };
 
  player := 1;
  score := 0;
  for score > -1000000 && score < 1000000 {
    move := sync_move(&s, player);
    s.move(move, player);
    score = s.evaluate();
    player = -1 * player;
 
    fmt.Printf("%s", s);
    fmt.Printf("%d\n\n", score);
  }
}

First, we allocate a new state struct on the stack, and initialize the 'heights' array. Then we alternate player turns, calling sync_move and making the resulting move, until the score indicates one player or the other has won the game. We've left out the code for printing the board; it's also in the complete source.

Compiling this and running the output with 'time' yields the following:

$ 6g go4.go $ 6l go4.6 $ /usr/bin/time ./6.out ....... ....... ....... ....... ....... o...... 3 ....... ....... ....... ....... ....... o.....x 0 ... Lots more ... ox.xx.o xo.oo.x xx.xx.o oo.ooox xxoxoxo ooxoxox 10019902 47.75user 0.00system 0:47.78elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (2major+1636minor)pagefaults 0swaps

To parallelize this, we need to use Go's concurrency support. Go implements concurrency using two main primitives: goroutines, which are effectively lightweight, non-preemptively scheduled threads, and channels which provide ordered, strongly-typed delivery of messages between goroutines. To start a new goroutine, simply call a function with 'go' in front of it. Sending and receiving messages uses the '<-' operator, and by default sends and receives block until both a sender and a receiver on the same channel are ready.

Here's negamax_async, a modification of negamax_sync using goroutines and channels:

func negamax_async(s state, column, player, depth int, ret chan move) {
  s.move(column, player);
  
  score := s.evaluate() * -player;
  if score < -1000000 || score > 1000000 || depth == 0 {
    ret <- move{column, score};
    return;
  }
  
  results := make(chan move);
  num_results := 0;
  for i := 0; i < 7; i++ {
    if s.heights[i] >= 0 {
      go negamax_async(s, i, -1 * player, depth - 1, results);
      num_results++;
    }
  }
 
  min_score := 1000000;
  for i := 0; i < num_results; i++ {
    m := <-results;
    if m.score < min_score {
      min_score = m.score;
    }
  }
 
  ret <- move{column, -min_score};
}

There are several significant changes from the synchronous version. The return value has been replaced by a channel passed in as an argument, we create our own channel for sub-goroutines to return results to, and execution now proceeds in distinct 'scatter' and 'gather' phases. First, we spawn a goroutine for each branch; then, we retrieve results from the goroutines via the channel (waiting for them to complete if necessary) and calculate our return value - which we send to our own return channel.

Here's the correspondong 'async_move' function:

func async_move(s *state, player int) int {
  ret := make(chan move);
  num_moves := 0;
  for i := 0; i < 7; i++ {
    if s.heights[i] >= 0 {
      go negamax_async(*s, i, player, 6, ret);
      num_moves += 1;
    }
  }
  
  max := move{-1, -100000000};
  for i := 0; i < num_moves; i++ {
    if m := <-ret; m.score > max.score || (m.score == max.score && m.column < max.column) {
      max = m;
    }
  }
  return max.column;
}

Again, the changes here follow the same general pattern. You may have noticed there's some code duplication between our '_move' and 'negamax_' functions; this could probably be eliminated, but would complicate the code somewhat.

Changing main() to invoke async_move instead of sync_move results in the following output:

$ /usr/bin/time ./6.out 
...

ox.xx.o
xo.oo.x
xx.xx.o
oo.ooox
xxoxoxo
ooxoxox
10019902

56.18user 4.19system 1:41.67elapsed 59%CPU (0avgtext+0avgdata 0maxresident)k
1264inputs+0outputs (9major+1682186minor)pagefaults 0swaps

Okay, that's actually worse than the synchronous version. But wait! As it stands, Go requires an environment variable, GOMAXPROCS, to specify the number of threads to use. Let's try it again with GOMAXPROCS=4, the number of CPUs on this machine:

$ GOMAXPROCS=4 /usr/bin/time ./6.out
...

ox.xx.o
xo.oo.x
xx.xx.o
oo.ooox
xxoxoxo
ooxoxox
10019902

102.36user 117.21system 1:08.50elapsed 320%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1574943minor)pagefaults 0swaps

Okay, so that's better than it was single-threaded, but still worse than the entirely synchronous version. Note too that the system CPU time is up substantially, while there's only a small increase in user CPU. Doubtless this is due to the increased thread switching overhead exceeding any gains from parallelization.

There may still be some improvements to be rescued from our strategy, however. What if we parallelize each of the top level negamax evaluations, but not any of their children? That still lets us split the work up as many as 7 ways, with a minimum of communication overhead or thread switching issues, since all the work units ought to be fairly small. Here's a revised move function that takes that approach:

func async2_move(s *state, player int) int {
  ret := make(chan move);
  num_moves := 0;
  
  for i := 0; i < 7; i++ {
    if s.heights[i] >= 0 {
      go func(col int) { ret <- negamax_sync(*s, col, player, 6) }(i);
      num_moves += 1;
    }
  }
    
  max := move{-1, -100000000};
  for i := 0; i < num_moves; i++ {
    if m := <-ret; m.score > max.score || (m.score == max.score && m.column < max.column) {
      max = m;
    }
  }
  return max.column;
}

The only difference between async_move and async2_move is that instead of invoking negamax_async as a goroutine, we're invoking negamax_sync. We're also taking advantage of another neat feature of Go - anonymous functions (they're also full closures). Running our third revision with GOMAXPROCS=4 yields:

$ GOMAXPROCS=4 /usr/bin/time ./6.out
...

ox.xx.o
xo.oo.x
xx.xx.o
oo.ooox
xxoxoxo
ooxoxox
10019902

48.31user 0.07system 0:48.24elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (2major+1887minor)pagefaults 0swaps

Now that is puzzling. I somewhat expected the poor performance of our cilk-like concurrency, but I certainly didn't expect the performance of the slightly-concurrent version, on a 4 processor machine, to be no better than the original synchronous version. To confuse things further, a few facts:

  • Trivial code that steps through loops millions of times exhibits the expected speedup when spawned using goroutines, so it's not that go doesn't do parallelism properly yet.
  • However many processes we specify with GOMAXPROCS, they all consume 1 nth of one CPU, totalling up to no more than 1 CPU at any one time - despite there being no shared state or synchronization between goroutines!
  • While performance is slightly worse on this machine with either 2 or 4 processors used, testing this code out on my Mac Pro shows it shaves a few seconds off the runtime when using 2 processors - and still without exceeding a total of 1.0 CPU utilization.

Admittedly this last result has me stumped. If anyone has some insight, please let us know in the comments.

In general, Go seems like a rather nice new lanugage. I don't like everything about it, but I do like a lot, and the concurrency features I demonstrated above is only part of it. I look forward to exploring Go more, and it seems like a good alternative to my traditional choice - C- next time I want to use a systems programming language.

Comments

blog comments powered by Disqus