In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.
"What are you doing?", asked Minsky.
"I am training a randomly wired neural net to play Tic-Tac-Toe" Sussman replied.
"Why is the net wired randomly?", asked Minsky.
"I do not want it to have any preconceptions of how to play", Sussman said.
Minsky then shut his eyes.
"Why do you close your eyes?", Sussman asked his teacher.
"So that the room will be empty."
At that moment, Sussman was enlightened.
According to that link, it's based on a true story:
So Sussman began working on a program. Not long after,
this odd-looking bald guy came over. Sussman figured the
guy was going to boot him out, but instead the man sat
down, asking, "Hey, what are you doing?" Sussman talked
over his program with the man, Marvin Minsky. At one point
in the discussion, Sussman told Minsky that he was using a
certain randomizing technique in his program because he
didn't want the machine to have any preconceived notions.
Minsky said, "Well, it has them, it's just that you don't
know what they are." It was the most profound thing Gerry
Sussman had ever heard. And Minsky continued, telling him
that the world is built a certain way, and the most
important thing we can do with the world is avoid
randomness, and figure out ways by which things can be
planned. Wisdom like this has its effect on
seventeen-year-old freshmen, and from then on Sussman was
hooked.Convolutional networks, deep learning, and machine learning in general are arguably the most disruptive technologies in our lifetimes. BTW, pardon the shameless plug, but I just started a new blog this morning to host ML resources and my own experiments: http://blog.cognition.tech/?view=classic
Literally any reasonable learning algorithm, even nearest neighbor, will learn that fairly quickly. The article had to use millions of samples, which is extremely excessive - a more efficient algorithm should only require thousands.
What might be interesting could be to see whether the neural network learns the underlying rule, that placement does not matter, only the sum matters, and that the crucial value is "3". That would be cool to see.
This is a very good point and it's the reason why I don't find much value in GAs. There is an inherent conflict between "an algorithm for a specific problem" (which can be optimized to run in a short time - the above problem can be reduced to a lookup in a 256 x 2 bit matrix and thus run in O(1)) and "an algorithm for any problem (in a class)" which, while theoretically capable of finding a solution, is prohibitive in time and/or space.
As far as I can tell, GAs are barely any better than a random walk and thus not useful for any non-trivial problems. I am not yet sure if NNs are in the same category, though I suspect they are.
I love it when there's a nugget of rigor behind things like this that can be pulled out and applied elsewhere!
- The conv layer has only 20 relu units, I'm not sure if that suffices to memorize the rules. The net might be "underfitting" the rule set, and therefore making mistakes in rare pixel arrangements.
- More worryingly it's also strange that the author uses a fully connected layer right after the conv layer (this is implicitly added in ConvNetJS when you specify a loss layer). This means that the output neurons are a function of the entire preceding CONV layer activations everywhere, while the game of life rules are local. The way to do this would be to instead use a 1x1 CONV layer with 2 neurons on top of the first 3x3 CONV layer, and interpret it as computing the class scores at every spatial position. However, in this case you'd want to apply the loss on every spatial position, and this "fully-convolutional" loss use case is not supported out of the box in ConvNetJS, but could be written.
- However, with a fully-convolutional loss zero-padding of 1 used around the borders might cause trouble. Normally this is okay with images, but here this might cause trouble because the neurons all share parameters spatially (and hence compute the same function) and don't "know" if they are at the border on in the middle of the image. I'm not sure how how this game of life handles boundary conditions, but if borders obey different dynamics then you'd want to distinguish the border pixels with a special "border" feature vector at the input. E.g. each pixel is a 3-vector, with a 1-hot encoding for (border, positive pixel, negative pixel).
- And it's also strange that the author uses "regression" loss for some that is a binary classification problem.
So, nice attempt but several funny choices, and clear why it didn't fully work :)
We could use a softmax classifier as this is meant for multi-class binary classifications. Each class in this case would represent a neighboring pixel, a classification of that class would represent activating that pixel. However, softmax assumes one-label and the probabilities add up to 1. Our problem is a multi-label multi-class binary classification.
We could train 9 of such softmax node groupings for each neighboring pixel but that immediately seems to be a bad idea.
Another awful solution - make each possible pixel configuration (2^9 of these) a class and do a standard soft-max.
I can start to see why the author chose to use regression loss as a sort of hack to get this to work, but I'm trying to think out the best, proper solution.
Any thoughts?
Do you think a similar approach could be used to learn to generalize Navier-Stokes by looking at fluid flows?