2010/cogbox10

Cognitive module

This workgroup will look at features necessary for a cognitive module in the game player.

Projects

Neural network implementation

Timmer Horiuchi

The Coggie project is an attempt to construct a neural network that is capable of playing the card game "Hearts". This network is designed to direct the oculomotor system to: 1) look at the first card in the trick, 2) look at the other cards in the robot's hand to find the lowest value, matching suit card, 3) if there is no matching suit card, it will scan the hand again to find the largest value card. If the robot needs to lead the trick, it will look for the lowest value card in the hand that is not a heart. Once a heart has been played, it can lead with a heart suit. See the ppt file 'coggie_tkh.ppt' below.

Reinforcement learning

Frederic Broccard

The idea of this project was to develop an unsupervised learning agent endowed of decision-making abilities to play the game of Hearts with four players – the agent being one of the player. Assuming that the reader knows the rules of the game (see  http://en.wikipedia.org/wiki/Hearts), the basic approach of this project was to used reinforcement learning using Monte-Carlo methods and simulate many games to estimate the value of each card in the different situations encountered during a typical game of Hearts. These situations correspond to different states of the game, such as zero, one, two or three cards have been played in the trick. Besides states, reinforcement learning methods consider their values, the actions taken in each state and the reward obtained for taking such actions. In the context of the game of Hearts, the actions corresponded to the different card played by the agent in different states. I begun to estimate the values of the different cards when three cards were present in the trick, which equal to the easiest case. This correspond nonetheless to 52*51*50 different possible states, that can be divided by a factor of two as the order of the second and third card does not matter for the decision-making process. Indeed, according to the rules of the game, the first card played in the trick indicated the suit to be played in this round and the different players have to play cards from the same suit if they have any in their hand. The highest card in the trick wins it. Thus, the basic and most simple strategy would be to play the lowest possible card from this leading suit or any hearts. We played Hearts with simplified rules and did not take into account the queen of spades nor the “shooting the moon” scoring variant. Based on the outcome of each trick – and the action taken by the agent – , a reward was given to the agent depending on whether or not it wins the trick and how many hearts it took on the trick. If such cases, a reward of -1 was given and zero otherwise. A final reward of +1 was given whether the agent did not score the highest number of hearts at the end of the game. Simulating many games allowed to get an estimated value of each state from which the agent made its choices for taking actions – playing a given card. I considered state-action pairs as it is often the case for games and updated the values of such pairs based on the average reward gained for each game and/or each state-action pair encountered. Games were considered as different episodes and averages were taken over them. Backing up the values estimated during a full episode are the core of the Monte-Carlo method. The basic pseudo-code for such an algorithm is given below (from Sutton & Barto, 1998, MIT Press) :

states = cards in a trick
actions = card played in a trick
Return(state, action) = sum(rewards)

Values(state, action)←zero
Decision policy for actions ←random
Return(state,action)← empty

Repeat forever :
i) generate an episode (game)
ii) for each pair (state, action) appearing in an episode

R ←return following 1st occurrence of (state, action)
Append R to Returns(state,action)
Values(state, action)← <Return(state, action)>

iii) for each state in the episode (game):

policy ← argmax Values(state, action)

This algorithm allowed to built a look-up table for all the different possible state-action pairs when three cards were present in a trick. Estimated look-up tables when two or one card where present in the trick were also constructed from the three cards look-up table and using a “visual memory” of the previously played cards to constrain the number of possible states and actions to be played. Heuristic method based on the frequencies and probabilities of the previously played cards and suits were used to play the first card in a trick for cases where the agent won the previous trick and has to play first. Thanks to all the cogbox people for discussions and to the spike-based computation for finding such a challenging task and for the fun during the testing phase of the different cogbox modules the last night !

Spiking primitives

'Ravi Shekhar'

Confidence State Method, a.k.a. Ganglia box

'mrapson'

In this cognitive module the aim was to develop a module that could play advanced strategies in a way that could be implemented with spike-based computations.

Hearts is a complex game in which the optimal play often contains an element of risk. As an example, take a holding of queen, five in diamonds. When the first diamond is lead the player with these cards needs to decide between playing the queen and playing the five. Since the queen is a high card and the suit is short, there is a risk in playing either card. If the queen is taken early, it is likely that the player will be forced to lead and there is always a chance that the queen of spades (worth 13 points) may be discarded under the queen of diamonds. If instead the five is played there is a chance that the queen can be discarded before diamonds is lead again. However if this does not happen, the player is forced to play the queen on the next diamond trick. The risk of a player being void and able to discard a point card is greater on each successive on the next diamond trick. Furthermore, it is harder to find a good lead when there are fewer tricks left in the hand.

The example above illustrates some of the risks that need to be balanced in determining an optimal play for a given situation. A typical approach to evaluate the risks associated with each play would be to store a buffer of all cards played and who played them. This buffer can then be examined during each trick to determine an optimal play. However, this approach would be expensive to implement using spike-based computation, furthermore human players are able to learn to play hearts to a reasonable proficiency without being able to recall all the cards that have been played.

Based on experience of learning to play hearts, it is hypothesized that human players primarily keep track of relatively few key pieces of information concerning the game. These could include which suits players are void in, how many hearts are outstanding and whether the queen of spades has been played, and which cards in the hand are high and low. The Confidence State Method focuses on recording the state of the game by updating a set of weights representing the players opinion of the strength of each card in its hand.

Before the first card is played, the weights assigned are initially set to a default value read from file. These initial values would represent the opinion a human player would have of a card as the pickup and sort their hand and these values could be learned by the system over the course of many hands and games. The values of the cards in the hand are then adjusted based on the whole hand. This process copies a human player noting the strengths and weaknesses of the hand and evaluating how to handle the hand.

As each card is played the system changes its evaluation of the strength of each card in its hand. It also changes a few global variables that represent the players cautiousness or nervousness and aggressiveness in its style of play. To determine a card to play at each turn, the weighting of the cards in the hand is adjusted for cautiousness / aggressiveness and the legal card with maximum value is played. This final step of selecting the maximum value play could be implemented using Timmer Horiuchi's neural network approach.

For each turn a player in hearts is either leading, following suit or discarding a card when they cannot follow suit. The current implementation of the game works well for leading and discarding, however following suit (which is in general the hardest, see the example for an illustration of why) is still problematic. In the remainder of this discussion, the implementation for following suit will be illustrated and finally selected results from a hand played against the Microsoft Hearts computer will be presented to show the evolution of card values during the course of a hand.

When following suit it is generally best to play the highest card in the hand that will not win the trick. However this is sometimes not possible and sometimes it is worth taking a trick to make a specific lead or to avoid getting stuck with the lead later in the hand. When following suit only the relative values of cards in the suit are relevant, and they will increase with increasing face value (excluding the Queen of Spades of course). These values evolve as cards are played, but relatively slowly. To calculate the value of playing each card on this trick, their values are multiplied by a soft threshold, see figure 1. The function is currently based on a hyperbolic tangent and the location of the threshold is selected based on the card currently winning the trick. As the figure shows, the shape depends on how cautiously we are playing. This in turn is dependent on the state of the game and the cards in the trick so far. The soft threshold function could be improved: there is often a tension between playing a card slightly higher than the one currently winning hoping another player will be forced to win the trick or playing the highest card in the suit, and this tension is not captured well by this function.

Figure 1: Soft threshold function used to weight card values when following suit. This threshold favors playing the highest value card that will not win the trick, however cards that might win the trick may still be played. The likelihood of playing a winning card when other plays are available is set by the cautiousness/nervousness and the threshold is displayed for values of 0.1 to 0.9.

To illustrate the evolution of the card values with time, the output from the cogbox for two tricks during a game against Microsoft Hearts is given:

Trick 1, showing original values. dv - discarding value, fv - following value, lv - leading value

Following Nervousness, leading aggression

0.80000

0.00000 0.90000 0.90000 0.90000 0.90000

local hand, dv, fv, lv

2.000000 1.000000 0.102913 0.140000 0.632179
4.000000 1.000000 0.139667 0.190000 0.595424
6.000000 1.000000 0.198475 0.270000 0.536617
9.000000 1.000000 0.330791 0.450000 0.404301

12.000000 1.000000 0.588074 0.800000 0.147018
13.000000 1.000000 0.676285 0.920000 0.058807

1.000000 2.000000 0.089387 0.128000 0.608950
2.000000 2.000000 0.104751 0.150000 0.593587
7.000000 2.000000 0.237435 0.340000 0.460903
8.000000 2.000000 0.279335 0.400000 0.419002

10.000000 2.000000 0.384086 0.550000 0.314252
11.000000 2.000000 0.467886 0.670000 0.230451
13.000000 2.000000 0.642470 0.920000 0.055867

I PLAY: Ace of Spades Number: 13. Suit: 1.

Trick 4, showing modified values, zeros in first two columns represent cards that have been played.

Following Nervousness, leading aggression

0.88200

0.00000 0.84733 0.90000 0.88209 0.84733

local hand, dv, fv, lv

2.00000 1.00000 0.13791 0.18761 0.49484
4.00000 1.00000 0.18271 0.24249 0.48549
0.00000 0.00000 0.20840 0.28350 0.51515
9.00000 1.00000 0.40256 0.49613 0.37260
0.00000 0.00000 0.68158 0.84000 0.14114
0.00000 0.00000 0.67628 0.92000 0.05881
1.00000 2.00000 0.08939 0.12800 0.60895
2.00000 2.00000 0.10475 0.15000 0.59359
7.00000 2.00000 0.23743 0.34000 0.46090
8.00000 2.00000 0.27933 0.40000 0.41900

10.00000 2.00000 0.38409 0.55000 0.31425
11.00000 2.00000 0.46789 0.67000 0.23045
13.00000 2.00000 0.64247 0.92000 0.05587

I PLAY: 10 of Spades Number: 9. Suit: 1.

This approach provides a means of implementing a cogbox that is able to make calculated trade-offs with out requiring perfect memory. The implementation can be improved by careful balancing of the weights used - ideally allowing some to be learned. There is also scope for modifying the implementation to copy further aspects of a human's play.

Thank you to Steve Kalik for useful discussions about this idea.

Hand estimates

Steve Kalik

THE GOAL

The goal of this project was to help evaluate the likelihood of taking the trick with the playing of any given card from a player's own hand, given the position of the player in the trick sequence and the likelihood of seeing other cards from the deck played on this trick by other players.

PROBABILITY FUNCTION FORM

In general, the likelihood of seeing a particular card played in a given trick is:

-Zero if it has already been played in another trick.

-Dependent upon the Strategy of the player holding it.

-Dependent upon the value of the card to carrying out the Strategy.

-Dependent upon the other cards available in the player's hands with which to achieve the Strategy

-Dependent upon the likelihood that the player will maintain the expected Strategy in this trick.

-Dependent upon the likelihood that the card will be necessary to achieve a particular strategy in a later trick.

In general, the likelihood of winning or not winning any particular trick is:

-Better known depending upon the number of cards left to be played in the trick

-Better known depending upon how extreme the value is compared to the values of cards remaining in the deck

SIMPLIFIED VERSION OF THE RULES:

In the version of Hearts played at the 2010 workshop, and specifically in the interest of making progress on the project in the shot period of time allocated to work on it, certain standard parts of the game of Hearts were not included in game play. These simplifications are described already in Frederic Broccard's section, above. They are:

Not assigning any points to the Queen of Spades card, and

Not providing a bonus for "Shooting the Moon"

No exchanging of cards prior to play.

IMPACT OF SIMPLIFIED RULES AND SHORT DEVELOPMENT TIME:

Working quickly within this simplified version influenced the selection of probabilities of play in the following ways:

  • Increased probability that players were trying to win as few tricks as possible
  • Increased probability that players sought to win as few tricks containing hears as possible
  • Increased confidence that players would stick with these strategies throughout the game.
  • No side information available from card exchange.

An example system that followed this strategy explicitly was Coggie, described above (see Timmer Horiuchi's description).

In breaking down the likelihood of seeing cards played, we began with the simplest possible case: all players choose cards at random to play, to see what this shows. Before any cards are dealt, all players are equally likely to get each card, producing a 1/4 chance of getting any card, and a 13/52 (=1/4) set of cards dealt to each player. This is shown in figure 1, below.


Figure 1: Uniform probability of being dealt each card from the deck

Once the cards are dealt, the players can then separate what they know (their own cards and any cards they've seen played so far), from what they must estimate (which cards are in the other players hands, and which players are holding which cards.) What is known with certainty then goes either up to 1 if it is known held, or down to 0 if it is known not held. Cards not held can be known for looking in ones own hand (if you've got it, they can't have it), and form looking at what's already been played (if someone already played it, it can't be played again.) For the remaining 3/4 of the cards not in one's own hand initially, the probability that the are in any other players hand is 1/3 (one chance that it's in each of the other players hands, and 13/39 of the remaining cards are in each players hand.)

Raising the probability of the players own cards to a value of 1 is shown in figure 2 for the hand of a particular player. The probability of getting any the card is set to zero. (Note: This function can then be filtered by the probability of getting a card passed to you in the swap before play starts, if that rule is re-instated in future versions of this project.)


Figure 2: Cards held are indicated by probability of 1, cards not held are set to a probability of 0.

Once the probabilities of the holding of a card in one's own hand are resolved as described and shown in figure 2 above, each player can use that information to go on to estimate the cards that are held in other players hands. Figure 3 demonstrates how this can be represented in a simplified form.


Figure 3: Estimates of one's own hand for each player, and of the hands of others given that players knowledge assuming they remember every card played. Each square is a single player's estimate of then likelihood that they hold a card. Titles indicate which square is which player's set of estimates. e.g. the upper left square shows player 1's estimates. The horizontal axis represents each card in order of card value [2 .. J Q K A] for a given suit, restarting the cycle for the next suit and continuing for each of the four suits. The vertical axis within a square represents each player who's card holdings are being estimated (four rows for the four players, with player 1's holdings across the top row and player 4's holdings across the bottom row. Red indicates cards held with 100% certainty (probability of 1). Dark Blue indicates cards that are known to not be held (probability 0). The color bar indicates the estimated probability that the player is holding other cards. In the case shown here for random selection of cards, the probabilities are all 1/3 when they are not 1 or 0.

The result of a sequence of played tricks is shown in Figure 4, demonstrating the updating of these tables for each player after each trick. It assumes each player selects cards randomly, without even worrying about playing by the rules, and therefore makes no marginal information gain about other players cards from their plays, except for the removal of cards that have been seen/played from the set of remaining possibilities.


Figure 4: The sequence of played cards, as tricks although the do not necessarily constitute legal plays in this case. Each individual subfigure follow the format of figure 3, and the sequence of tricks goes down each column and then back up to the top of the next column. By the last trick, it is virtually certain which 4 cads will be played, but still the players do not know which other player has which card.

This model of tracking which cards have been played provides some improvement in estimations of what can be played, but without the assumptions of strategies by the players, it does not do very well estimating what will be played until the end of the hand. In the interest of helping future groups that might work on this problem or one similar in nature, we point in the next section to future work that can build on the memory structure to iteratively infer strategies from a small set of strategies, and that can in turn improve or at least bias the estimates fo what cards a player holds based on estimates of what they should play if they are following one strategy or another.

FUTURE WORK: IMPROVING THE ESTIMATES As described previously, the strategy a player seeks should influence their choice of cards to play. In the next simplest terms, we will think in terms of strategy in selecting accord for a particular trick. For any particular trick, there are really just two possible primary strategies: to Win the trick, or to Lose the trick. Typically the goal is to lose the trick to avoid taking cards and to avoid the risk of taking point cards. In this discussion we will ignore the probability of this strategy fo rnow, but will focus on simply how to play to achieve the strategic goal for particular trick.

Let us assume the case where a player's goal is to lose the trick. If the player plays the lowest card, they will surely lose. (For the purposes of this discussion, any card outside of the lead suit will be automatically low enough to lose, so we will simply count that as below the lowest card in the suit,) Now, Figure 5 Shows an estimate of the probability of winning the trick with any particular card played. Note that there may also be some finite probability that other players will not have cards in the suit, so the probability of the lead player winning the suit cannot be zero, while it can be zero for players who are void in the suit that is led. For this example, we will assume that this is the first trick in the suit in the example, the first time any cards from that suit have been played, and that cards are distributed with uniform probability, so each player can be expected to have cards in the suit. In such a case, the probability of winning the trick increases with each increase in the face value of the card after the (n-1)th lowest card, where n is the number of players holding cards in the led suit. Also fro the purposes of this discussion, we will refer to the players as players f, g, h, and k, playing their cards in that order.


Figure 5: Probability of winning the trick with the play of given lead card. Note that the lead player always has an increased chance to win because they are the only player that cannot, by definition, play out of suit.

Once the first player selects a card to play, all cards with values below that will card will lose the trick with equivalent, 100% certainty. Thus the probability of losing can be known better than the probability of winning, because any the single highest card wins, while an of the cards lower than the highest value lose. In the example shown here, imagine that the first player, player f, plays an 8. This should then skew the probability that other players will win, and can be applied in their decision making process by filtering the cards they hold with a bias curve that rises sharply at the location of the played card in the graph but which continues to grow with each successively higher valued card that could be chosen. (Note: this is similar to the model proposed by Michael Crosse in the previous cogbox results section. But his aggressiveness factors would be applied after the probability of winning were explicitly determined here, his function is smooth and relatively stable above and below, where this version continues to increase the likelihood of winning the trick as the value of the card goes up, and the function shown here is the opposite of his in the sense that the one shown here (probability of winning) is the opposite fo the desirability to play the card if the strategic goal is to lose the trick begin played, and his system would select a card at the output, while this one would provide a background function that could then be filtered through an appropriate analog to his function to drive the final selection of which card to play. The discussion in this section will not include the aggressiveness factor Micheal describes, but it would be great to see the two integrated at some future time.


Figure 6: After player f plays his card, the selections of player g can be filtered according to the image shown. Also notice that the number of cards left to play in the trick has been reduced by 1, as has the number of cards available for g, h, and k to play.

As player g now selects a card that will lose relative to player f, two hypotheses can be put forth: either player g had a strategic goal to lose this hand, or player g did not have the appropriate cards to achieve the goal of winning the hand. Weight can be added to the probability of both of those for comparison to future tricks in the hand to update estimates of the cards g has in this suit, and the strategy g is trying to play. In Figure 7 we see g's selection, in this case selecting a card below f's, and the minimal effect of this on the filtering function describing h's chance of winning the trick, which would then be overlaid against h's hand.


Figure 7: The card g chooses is below f's, leaving f still at risk to win, and assuring g that he will lose this trick.

With g playing a card below f's, there is essentially no effect on the immediate context for h's card selection for this trick. However, h may use the information gained so far to select cards to play and to hold in this trick in order to improve his chance to achieve both current goals for this trick, and longer term goals like winning or losing some or all of the remaining tricks in this hand. Figure 8: shows an example of h's selection given this context. H can estimate that there will be some increase in risk to playing a card above f's, but cannot know for certain if he will win the trick until k plays his card. The other players also can now add weight to two hypotheses: 1. that h might have wanted to win the trick, and 2. that if h wanted to lose the trick, he didn't have cards that he could use to meet that goal.


Figure 8: The card chosen by h to play was above f, but not so large as to assure winning. The play of this third card completely divides the space into sharp decisions for k. Cards below h will lose the trick, and cards above h will win the trick, both with no uncertainty.

In the example shown in figure 8, the decision of the outcome of the trick is now completely in k's hand. If k chooses to play a card above h's, k will win the trick and lead the next trick. If k plays below h's card, k will lose the trick and will go second in the next trick. This can be known with 100% certainty, so in some sense the risk is less, and the ability to play a higher card is available to k in a way that was not as safe for h in the last selection, in that he is now likely to win the trick as a result of the relatively high card he played. Figure 9 shows the result of k's decision.


Figure 9: K decides to play a surprisingly low card, indicating that he either wanted to lose the trick, or that he had no cards high enough to win the trick. However, given the range of cards that k could have played while still losing the trick, the playing of this extremely low card suggests that their is additional information valuable in that play: either k does not have higher cards in that suit, or did not want use them, suggesting that k may be interested to play high, and perhaps to win a trick or two in the future.

Now that the trick is complete, the effect of the trick can be seen on the shape of the probability of winning for a new lead card played into the same suit as this last trick. Figure 10 shows how the previously played cards have altered the probability curve for winning the trick, and how this has raised the mean value of the remaining cards in the suit, so that higher cards can be played while maintaining the same probability of winning the trick that was found in the prior trick.


Figure 10: Note the skewed curve of probabilities for cards played within the same suit used in the last trick. (Compare to Figure 5.) The four cards already played cannot be played again in this hand. Although not shown here, additional estimates can be made by applying this cue to the likely card availability curves after card availability estimates have been modified by being filtered with the hypotheses indicated for given strategy from the previous trick.

The approach shown here follows some of 'mrapson''s ideas, in that the key hypotheses of players; goals and available cards are proposed to be estimated and propagated forward, rather than the value of the specific cards they played. This seems parsimonious, in terms of carrying information about ranges from the individual values, but more thought on the most efficient way to do this may yet be necessary.

INCLUSION OF SHOOTING THE MOON One last comment on the simplification of not allowing the "shoot the moon" play: This impacts predominantly the high level strategy, allowing players to concentrate on never having to win a trick. It reduces the uncertainty in what the other players are trying to do, and in what strategy any individual polayer need follow. However, it also removes some relatively simple probabilistic decision making, which might provide a useful problem fro a small future project if other pieces are already in place. Fro example, the mechanism guiding shooting-the-moon related decisions need not be overly complicated. A simple logical model that could do the trick is the following:

If you have all of the cards you need to win all the tricks, try to shoot the moon.

If anyone else has already won a point, don’t try to shoot the moon.

If two other people have each won a point card (a heart card), you don’t have to win any tricks, and neither does anyone else anymore.

If one person has won half or more of the point cards, try to win at least one point card to avoid letting them shoot the moon.

This can be constructed from fairly simple logic, and can drive strategy selection to generate interesting modes of play.

Many thanks to 'mrapson' and 'Chris Rozell' for fruitful discussions about the probabilistic filtering and the goals in playing Hearts strategically. Also thanks to Andreas Andreou for additional discussions about probabilistic reasoning, and to Timmer Horiuchi, 'Ravi Shekhar' and 'Tobias Glasmacher' for discussions on the layout of neural circuits and networks to carry out the variety of selection and computation processes required to run the basic CogBox? and to search through the cards to select ones to play from the players’ hands given the context of the cards already played in the trick.

Attachments