Sometimes knowledge hides away in difficult places, but now and then the time is ripe to venture out in search of it, no matter how hard the journey. Welcome to an expedition, an ascent, into the rarefied world of machine learning.
Don’t set off without packing the following basics
A computer, it is often said, only knows as much as the programmer that has given it its instructions: all it does is follow instructions. This is true of the simplest levels of machinery: software works on the commands from the programmer, going through them line by line. But does that mean that a computer can’t learn? To say that would be just as false as to say that a pupil can never be smarter than their teacher. So, just as a good teacher doesn’t just let his pupils learn facts by rote, but nurtures their own development, a computer can be programmed so that, the more time it devotes to fulfilling its tasks, it continuously improves in its ability to do so. Welcome to the world of machine learning.
The first self-teaching program to make a splash was developed by the IBM researcher Arthur Samuel in 1956. The software played draughts at a respectable amateur level. At the start, the computer only knew the rules of the game and a few rules of thumb that Samuel had given it. But with every game, the machine learned more. After eight to ten hours of training time, it was better than its creator. Today, humans can no longer beat computers at draughts. In chess, the computer is at least an equal match for us, and since Google’s AlphaGo program beat the European Go champion, humans are no longer undefeated in any board game.
Machine learning is a subdomain of Artificial Intelligence. Today, a wide variety of software techniques fall under this category: computers learn how to identify humans in photos. They drive driverless cars through city traffic, after they have trained for a few thousand hours. They find patterns in big data.
In many of these learning techniques, a human is still the teacher: the human sets a goal and evaluates the computer’s performance, while the computer varies and adjusts its behaviour in order to get better marks. At the same time, what could be called unsupervised learning plays an important role: the computer has to make sense on its own of masses of data. Thus, Google feeds millions of photos into a computer network, and the program creates automatic categories like “cat” or “human”. This closely resembles the way a young child learns, as they create categories before they can name them.
Let’s go! On the gentler slopes you will encounter knowledge which can bring you out in a sweat.
As a first climbing exercise, let’s play a game which is already too simple for five-year-olds: Noughts and Crosses. The board is made up of squares arranged three by three. Two players take it in turns to set down their counters. Whoever manages to get three counters in a row, straight or diagonal, wins. There are 255,169 possible outcomes in this game. In 131,185 of them, the player who goes first wins, with the second player winning in 77,904 variants. 46,080 variants end as a draw. More important than this is the fact that a “smart” player will never lose a game: Regardless of whether they go first or second, they can set down their pieces (or draw their noughts or crosses if playing with pen and paper) so that the game at least comes out as a draw.
How can you figure out the best move to make in a given situation? In Noughts and Crosses, all possible moves can be calculated beforehand. That leads to a decision tree: a player looks at all the moves that they can make given the current state of play, then at all possible responding moves from their opponent, and so on. In chess, this leads to an explosion in the number of possible configurations; but in Noughts and Crosses, the potential combinations are limited enough to be manageable: after at least nine moves, the playing field is full and will show any of 138 end positions. Every branch ends with the victory of one of the players, or a draw.
In order to assign a value to every playing position, one evaluates every leaf on this tree: a win gets a value of +1, a loss gets -1 and a draw is given as 0. Then take a step back through the game. Every sub-branch of the decision tree is allocated a playing position and a value, which is the highest of the following values if it is your turn, and the lowest of the following values if it is the other player’s turn. At the end, all positions have an evaluation of 1, 0 or -1. Branches with a value of 1 mark a strategy which can only win.
An example: let’s assume that our opponent plays first and places their cross in the middle of the board (the best starting move). We then place our nought either in a corner square or in a square in the middle of one of the grid’s sides. Which of these moves is better? Let’s look at variants in which we choose the middle of the left-hand row. There are then four essentially different possible responses for the opposing player to choose from. Let’s assume that they place their cross directly above our nought. Then in our next move we have no choice: We must place a nought in the lower right-hand square, in order to stop a diagonal line from being created. Then, the opposing player can knock us out of the game with a cross in the middle of the upper row.
In fact, our first move was fatal. It leads to a -1 in the decision tree, and should be avoided. If we had put our nought in the corner on our second move, even against the smartest player we would have an opportunity to fight them to a draw. This move has a value of 0.
How could we get a computer program to play with this strategy? First possibility: all the values in the decision tree are put in a table. The computer looks at every move in its table and chooses the move with the highest value. It plays perfectly from the first move and has no need to “think” at any point. Second possibility: The computer starts the game totally “stupid”. In every situation it marks the possible moves with the values -1, 0 and 1. As soon as a game is over, it changes these values retrospectively, in light of the outcome. In this way, its evaluation of the game-play will constantly improve.
If we now let the computer play against itself, something interesting happens: While both parties (which are in fact just one party) have no idea about the game, their retentive memory helps them to try out the different possible moves and learn which approach is good for one position and which is bad. And from a completely ignorant program, we get one that never loses a game.
Take deep breaths! It’s not how you expected—but you’ll make it.
Board games are comprehensible worlds with clear rules and unambiguous situations. While people can quickly surrender in the face of their complexity, for computers they are straightforward. On the other hand, thinking through muddy reality, which is easy for us humans, is extremely difficult for computers. Take, for example, an exercise which most people would hardly even label thinking: classification. Is that a photo of a cat or dog? Is that the voice of mother, or a stranger? Is that thing in the road a plastic bag or a rock? We are able to arrive at the right answers without any real thought and with an astounding degree of accuracy. But even we don’t know exactly how we manage it.
In the 1970s and 80s, people tried to teach computers to classify things using rules developed by experts: a cat is an animal with pointed ears and whiskers; a mouse is grey and has a long tail. This method didn’t at all work well. In recent years, we have had much more success with what is called neuronal nets, which imitate the structure of the human brain. They perform astoundingly well with large volumes of data.
Neuronal nets were actually invented in the Fifties, but they only came into their own with the development of modern computing power, under the label “deep learning”. William Jones and Josiah Hoskins described a very simple example in 1987 in Byte magazine. The neuronal net should help Little Red Riding Hood to survive the deep dark wood. In particular, it should keep her from being eaten by the wolf. The story also features grandma, and a huntsman, who saves Little Red Riding Hood.
Big ears, big eyes, big teeth
The program doesn’t know humans. It only sees particular physical characteristics and has to derive a particular approach from them. The wolf has big ears, big eyes and big teeth. When Little Red Riding Hood meets him, she should run away, scream, and look for the huntsman. Grandma has big eyes, wrinkles and is friendly. If Little Red Riding Hood spies her, she should come close, kiss her on her cheek, and offer her the food she has brought. The huntsman has big ears and is friendly and attractive. The desired behaviour: Little Red Riding Hood should approach him, offer him food and flirt with him (the article is, as I’ve said, almost 30 years old).
We can see right away that the relationship between sensory impressions and desired behaviour is far from straightforward: A being with big ears could be the wolf, but also could be the huntsman, and these each require a very different reaction.
The neuronal net is made up of two “layers” of cells: It has six input cells, which note the major characteristics of the actors (big ears, big eyes, etc.) and seven output cells, which correspond to Little Red Riding Hood’s repertoire of potential behaviours (running away, screaming, looking for the huntsman, etc.).
Every input cell is linked to every output cell, and at the start, each of these connections has a given “weight”—a number that describes its strength. We start with relatively small, randomly-chosen weights. This initiates the self-training of the network. It is fed successively with the input values for wolf, Grandma and huntsman (the first figure stands for “big ears”, the last for “attractive”):
Wolf: (1, 1, 1, 0, 0, 0)
Grandmother: (0, 1, 0, 1, 1, 0)
Huntsman: (1, 0, 0, 1, 0, 1)
The corresponding input value is passed from the input cells to all output cells (from “run away” to “flirt”), and is this multiplied by the weight of the respective connection. For each of the seven task neurons (from “run away” to “flirt”), six numerical values are given, which are added together. If the sum exceeds a threshold (e.g. 2.5) then the neuron “fires”—and the output cell assumes the value 1.
At the start, the net behaves randomly, because the weights of the connections are chosen at random. So that it can learn, we must compare the result with the desired action from Little Red Riding Hood:
Reaction to the wolf:
(1, 1, 1, 0, 0, 0, 0)
Reaction to the Grandmother (0, 0, 0, 1, 1, 1, 0)
Reaction to the huntsman: (0, 0, 0, 0, 1, 1, 1)
and alter the strength of the connection on that basis. After about 15 run-throughs, the net becomes largely stable. It develops the connections shown below left.
Why create this complicated training program, though, when we already know all the rules? In practice, the net is used in situations where the desired output is only known for a limited number of training examples. If the net is to analyze photos of animals (as digital volumes of pixels), and learn from them how to name the animals, we don’t say that a cat has pointed ears. That would mean that when the net has correctly identified the animal, it would not be able to formulate why it described a given image as a “cat”. Rather, it can re-use what it has learned on new pictures and recognize cats there too.
A strong drive to flirt
We have trained the Little Red Riding Hood net on three examples. There are a total of 64 possible inputs for the network, from (0, 0, 0, 0, 0, 0) to (1, 1, 1, 1, 1, 1). And each of these inputs will create an output in the net. Is this plausible?
For example, we can imagine what would happen if the wolf put on sunglasses and started being really friendly. That would correspond to the input values (1, 0, 1, 1, 0, 0). The output of the net which has been trained here would be: a certain tendency towards the correct reaction to the wolf (running away, screaming, looking for the huntsman), but also a strong drive to flirt. Clearly the wolf presenting himself like this confuses the girl, which is also understandable. Ambivalent input creates ambivalent behaviour.
Onto the summit
Now it’s getting drafty: You must master this theory if you want to rise to the occasion.
In order to further increase the performance of neural nets, the developers have come up with a trick: they insert a “hidden” layer of neurons between the input and output cells. Where the net is correctly trained, these neurons develop certain specializations. In our example, three cells can be inserted in the hope that these will specialize in the recognition of the Grandmother, the wolf and the huntsman (W, G and H in the graphic on the right). In the experiment they operate without any help. Cell W reacts especially to inputs which correspond to characteristics of the wolf, and triggers an appropriate response. The invention in 1986 of this hidden layer and its reasoning processes (so-called back propagation) marked a breakthrough.
This layer can be seen as an ever-higher level of abstraction of the sensory input: a net which has to recognize images only looks at disaggregated parts of images at the input level. The first hidden level of neurons will, perhaps, recognize starkly-contrasting edges. That is the basis for identifying, for example, circles or squares at the next level. Deeper in the net, neurons develop which can, for example, recognize eyes or even a cat’s head.
Sometimes the net also gives results which its creators rightly find embarrassing. For example, an automated image recognition program used by the photo service Flickr categorized men with black skin as “apes”. The gate of the Dachau concentration camp was labelled a “climbing frame”. The neural net has no prior knowledge and extremely limited tact. Software engineers need to train their algorithms in greater sensitivity.
Deep Learning is now yielding successes which eluded artificial intelligence for decades: the nets can recognize human faces on photos with confidence. They can understand spoken language very well. Skype can interpret between speakers of different languages in real time.
For the learning programmes named here, there was always a human teacher which trained the program in the correct answers. But increasingly, these nets are learning independently. They are fed huge volumes of data, and left to make sense of it themselves. Google engineers caused a stir two years ago when they put neural net “on drugs”. If you require the net to find an object in a plain image, as when a person looks for patterns in clouds, it will hallucinate and see, for example, fantastical fishes in the sky where there are none. The machines have learned to dream.
The article first appeared in ZEIT Wissen No. 5/2016, 16 August 2016. Reprinted with friendly permission from the Zeitverlag.
Christoph Drösser also wrote this article for his new book “Total berechenbar: Wie Algorithmen für uns entscheiden” (Totally calculable: How algorithms are making decisions for us), using a computer. Despite all progress in Deep Learning he remains sceptical that a computer will ever be able to write such things by itself.