Pattern Recognition in Stock Market Models
Why do I show a marine food web? Because in the stock market model I would like to identify money-webs. How money flows through exploitation of one trader by another. These webs are dynamic. I would like to know how they change over time, whether there are a small number of 'flux modes' in these money-webs, and how these modes vary.
Unsupervised learning to cluster directed 'exploitation' money-webs in stock market models.
We are interested in unsupervised learning because we have no a priori training signal that defines catagories. The first step is to form an R-way partition. The partition seperates all data points (a description of a directed exploitation graph) into mutually exclusive and exhaustive subsets. The second is to design a classifier based on the labels assigned to the training patterns by the partition (Nilsson, 1996).
Alternatively one may use Hierarchical Clustering Algorithms to partition the space into a tree, e.g. in taxonomic or phylogenetic analysis.
Most methods cluster on the basis of a distance measure between data points (patterns). For numeric patterns this may be as simple as using a Euclidean distance measure in an n-dimensional space. Knowing this, an important step to solving or formulating or problem is to try to define a distance measure between sub-directed-graphs of a directed graph. The simplest measure is simply overlap of nodes, ignoring edge properties. Let us use this for now, and develop refined measures later.
Using this method a dissimilarity matrix can be constructed and a hierarchical method used e.g. an agglomerative algorithm used to join pairs that are most similar, and create new distances based either on the minimum or the maximum of the distance of individuals in the pair with the other data points. These are the single-link method and the complete-link method respectively. (see Chapter 9 "Statistical Pattern Recognition", Webb). Another varient is the sum-of-squares method which is an agglomerative algorithm that joins data points on the basis of minimizing the total within-group sum of squared distances. Clustering criteria for sum-of-squares methods are based on the covariance matrix from which within-class scatter matrix (pooled witin-group scatter matrix) can be calculated. Then the trace of the diagonal elements of this matrix can be minimized. This is equivalent to minimizing the total within-group sum of squares.
Alternatively, quick partitions could be used, e.g. leader clustering. Mixture methods assume the data is composed of a weighted (mixing proportions) sum of g probability distributions depending on a parameter vector. The task is to find g, the mixing proportions and the parameter vectors of each distribution.
A popular non-hierarchical method is the K-means algorithm to partition data into k clusters that minimise within-group sum of squares. Assign object to group which has the closest mean to it. Calculate new group means based on this assignment. Terminate when no movement of an object to another group will reduce the within-group sum of squares. Varients of the algorithm allow creation of new clusters and delection of old ones.
Self-organizing Maps are another technique.
Some useful tutorials are available online. For example...
Market Time-Series Data
1st June 2007. The behaviour over a long run is shown below. Price is shown every 100 trades in red and dividend in black, over the course of 106300 trading events starting at the 186400th event. Note that the price is unstable, even after many trades. This makes us suspicious about whether the program is actually working correctly. We need to visualize many features of the software to make sure it is behaving as we expect. One possibility is that the instability is because the conditional variance of the agent is calculated on the basis of the accuracy of the best active rule rather than the accuracy mean of all the active rules [Try changing this and see what happens.] Although the price remains at around 25, it fluctuates massively.
The wealth of agents can be plotted over the course of the trial, see below. The wealth distribution of the agents increases. One agent looses everything, whereas other agents gain. There seems to be no trend towards increasing mean wealth. Most agents wealth remains fixed, implying that they rarely take part in trading.
The agents stock holdings can also be visualized over time. Agents each holding one share at the begining. Note that the holdings of agents change very rapidly. For example the green agent sometimes holds everything, and sometimes owns nothing. This is bizzare behaviour. We should perhaps introduce a term which forbids bids or offers over a certain proportion of the total number of shares. Many agents have no holdings throughout the run. This is best seen in the ListDensityPlot shown below.
Agents appear along the x-axis and timesteps go up along the y-axis. Note that agents 4, 6 and 10 appear to hold most stock near the begining of the experiment, whereas agents 6 and 9 hold most stock near the end of the experiment. How does stock holding correlate with agent wealth?
The above graph shows that wealth against holdings for all agents over the time period shown above. Note that all agents started with a wealth of 20000, and a holding of 1. We see that to a first approximation there is a positive relationship between holdings and wealth.
Winning Agent Features
What are the features posessed by the winning agent? We want to understand what tactics the richest agent is using. To this end we have recorded the properties of the wealthiest agent every 100 timesteps. We can examine how these properties are changed by the genetic algorithm. We run the whole simulation another time, this time calculating the conditional variance for each agents prediction of price on the basis of the mean accuracy of all active rules rather than the best accuracy. The maximum trade size per agent per round is also reduced to 5 shares. The same graphs as above are shown for this second trial.
More agents have some ownership of shares compared with the previous run. No agent ever comes to own all the shares. There are still large price fluctuations, usually associated with a change the 16 bit pattern set. The change of the pattern set by just one bit can result in large fluctuations of price. The wealthiest agent identity over the trial is shown below. (Time shown as samples at 100 ts intervals).
We see that there 6 agents which have hold the title of wealthiest agent over the course of the trial after the first 50000 ts. The wealth of the richest agent over time is shown below. There is a tendency for this to increase to a plateau of approx 24000.
The holdings by the best agent are shown below. Note that they fluctuate much more than the wealth.
The expected price by the best agent is shown below. This is interesting. Most of the time the expected price of the shares is 0. We don't allow the predicted price of any one rule to be less than zero. However, it is clear that the expected price is not at all tracking the real price most of the time.
The conditional variance is shown below. The variance is normally very low, even reaching zero. Occasionally it increases to high levels. This is the mean accuracy of the currently active rules of the best agent. Where there are no active rules, there is no mean accuracy (nan). The times where there are no active rules, presumably corresponds to the time when the expected price is zero.
The best desired stock holding over time is shown below (black) along with best bids (red) and best offers (green). Offers and bids tend to be made rarely, and at around the same time periods that correspond to increased desired stock holdings. These are obviously periods where rules are active.
The a values of the best predictors are shown below. The graph suggests that rule diversity in the head of the agent might be being lost. When a rule has high accuracy, it can completely replace the other rules in the head. The agent may loose the capacity to respond to a wide variety of conditions, simply becuase in SOME conditions rule i was very good and so took over the population during that time. An explicit test for rule diversity is required. Note there seems to be more diversity in a than in b. a takes values that are above and below 1. b takes negative values on the whole.
The graph of b is interesting, because the best agents b's are all negative. That is, the linear predictor that is used tends to predict a linear loss. Why should the b part of the predictor be negative in this way?
The predicted price by each rule of the best agent at time t is shown below.
The accuracy of each rule used by the best agent at time t is shown below.
There is only a weak tendency for rule accuracy of the best agent to increase over time. The rules are shown below for a selection of time points. There are 100 rules (y-axis), each containing 16 conditions (x-axis). Rules start out random, blue = 2, red = 0, green = 1. We see that rules tend to show greater similarity to each other over time, since they arise from being copied. No indication is given of how active the rules are. It is very difficult to interpret this diagram.
The rules are ordered so that the ones at the top of each grid are on the whole the most accurate.
Unsupervised Classification of Exploitation Webs
The interaction between agents can be visualized in an agent x agent matrix, or directed graph. An arrow from agent A to B exists if A makes money and B looses money in a given time window W. If both A and B gain or loose money together, no edge exists. This defines an 'exploitation graph' showing correlations between money being gained and lost between agents. The figure below shows how the exploitation matrix can look qualitatively different at different times in the run.
[Qualitatively different looking exploitation graphs at different points in the run]
We want to understand how these exploitation graphs differ over time. Do they oscillate? Do they have multiple stable attractors? Are there cliques? Do they form a small world network or a scale free network? Are there cyclic exploitations, or tree like explotations?
We start to do this in as least biased a way as possible by using unsupervised learning algorithms to cluster feature descriptions of the exploitation graphs. The exploitation graphs for time window W = 100 are stored in text files downloadable here (eM.txt, eMF.txt). Mathematica is used to process these files (download .m file here). The mathematica code for defining dissimilarity between two such exploitation graphs is shown below.
4 graph descriptions are read into the function. The first two are matrices of exploitation values Eij defined as (gain of wealth by A x (loss of wealth by B))/(absolute change in wealth (A) + absolute change in wealth (B)). If this value is negaitive it is set to zero, i.e. if both A and B change wealth in the same direction, it is negative, and no exploitation is said to have occured. The second two matrices define the direction of exploitation (dEij) It contains a 1 at each site where gain in A - gain in B > 0. This means that when Eij is positive, if dEij = 1, A exploits B, and if dEij = 0, B exploits A. dEij has no meaning if Eij = 0.
The above mathematica function counts the number of dissimilar edges, where a dissimilar edge is where one graph exhibits exploitation between A and B and the other does not, or exhibits exploitation in the reverse direction between A and B. The dissimilarity matrix in its raw form is visualized below for some sampled interaction matrices from the end of the trial. A Matlab function has been written to do the same thing, download here. We see that a given exploitation matrix does seem to resemble matrices that arise at distant times as well as resembling matrices at close together timesteps. There are discontinuous rapid changes in distance suggesting a rapid flipping back and forth between at least two different attractors.
Mathematica or Matlab (pdist, linkage, dendrogram) can be used to do hierarchical clustering, although in this case this approach is not terribly informative, see figure top rigt. In fact there don't seem to be very distinct clusters here. The cophenet function determines the correlation between distances in the dissimilarity matrix and distances in the clustering graph. It should be 1 if clustering has worked, but here it is only 0.495.
Multidimensional scaling can be used to visualize clusters in this dissimilarity matrix. We do this by using PRAAT, which is designed for speech analysis. A file in PRAAT readable format is written by mathematica. All other operations are done from within PRAAT to generate various MDS views with ease. Another approach is to use Matlab (mdsscale, see here).
We see there are clusters, sometimes with rapid oscillations. Oscillations between clusters is visible in the diagram below.
Earlier we wrote that we wanted to understand how these exploitation graphs differed over time. Do they oscillate? When we 'join the dots' we see that there is an oscillation between 4 clusters at the corners of the space. This oscillation is likely to be due to a rapid oscillation between red and green, i.e. X exploits Y, followed shortly afterwards by Y exploiting X. What are the features of the interaction matrices at each cluster, i.e. what sort of features does the exploitation graph oscillate between? Let us simply observe what the interaction matrix looks like in the 4 clusters. Unfortunately, this is not very helpful either. There are many data points that have no exploitation and yet these are found all over the place. This is becuase these will inevitably be close to other data points with no exploitation. All graphs show that it is only one or two agents that are exploiting each other in a 100 trade window. There is really quite little exploitation occuring in the market. The meaning of the 'oscillations' is now not really very clear. We are not so confident in the meaningfulness of this clustering. It seems to capture nothing that I can translate to intuition.
As a final check before dismissing this clustering approach, it would be useful to do this clustering over a larger set of data points over the course of the entire run. We do this below.
[Clustering over the whole run].
We also compare the above run with a run in which rule evolution rate is increased to K = 250.
[Clustering over entire run when learning rate is K = 250.]
Perhaps scaling the dissimilarity by the sum of edges present in both graphs would help to make the disimilarity measure independent of graph density?
Making a Feature Vector Out of Exploitation Graph Properties
Next we look explicitly at the graph properties of exploitation graphs and in doing so we construct a feature vector that allows exploitation graphs to be described, and hence dissimilarity to be determined in hopefully a more useful way than the measure we used above. We record various properties of the exploitation graph such as
1. No nodes, no edges.
2. Draw the graph, showing the richest agents as large nodes, and fluxes as edge direction and thickness.