At the end of a typical experiment we have a list of sequences, and we wish to know which of these complete sequences (or which motifs embedded in these sequences) are likely to have been created by template replication, rather than by de novo generation by elongation mechanisms. However, standard statistical methods for doing this may be misleading. Why? It is important to realize that de novo generation by elongation mechanisms is NOT the same as random synthesis of sequences. This is partly because
i. Elongation mechanisms may favour the assembly of particular sequences because of stacking effects.
ii. The initial population of dimers may be non-random, tending to result in repeats arising due to assembly processes.
iii. Even template mediated elongation processes can copy sequence information, as long as this sequence information can be released to undertake further template mediated elongation reactions.
In fact this concept is SO exciting that we should examine closely whether it is possible for mechanisms of 'self-assembly elongation' to blend into a means of 'distributed replication' and how this can be done. Consider the following simplified model of template replication vs. elongation. Consider 4 kinds of sequence primitive, a, b, c and d. (e.g. these may represent dimers). Start the system with equal numbers of these sequence primitives. Assume the following operations acting on these sequence primitives...
First consider the operations that do not transmit existing sequence information but generate new sequence information. Such processes may be able to mediate attractor based heredity if coupled with a template producing metabolism.
1. Elongation by Spontaneous Ligation of sequences: Pairs of sequences are joined together with rates determined by the pre-fix and post-fix primitives comprising each sequence, and the corresponding cross-product SxS, where S is the vector of sequence primitive types. This produces non-randomly distributed nucleotide pair motifs, with the formation of motifs posessing high rate constants in the SxS matrix being biased. If the chemical composition of nucleotides remains fixed then the SxS matrix will not change and there will be no tendency for this bias to be altered through 'evolutionary time'. However, if it is possible for a process of variation (metabolic flexibility) to produce slightly different nucleotide types (sequence primitives) then the SxS matrix can itself grow and be altered. The SxS matrix may merely experience drift. However, if it is possible for the change of the SxS matrix to be controlled by the sequences themselves, then this may result in a recursive system in which increasingly effective self-assembling sequences can become stable. The details of a possible (but rather abstract) feedback system capable of some attractor based heredity utilizing sequence undergoing self-assembly are shown in the figure below. The figure shows a metabolism producing a range of sequence primitives S, which undergo self-assembly processes to form a sequence population. Sequences in the population are able to influence metabolism to produce sequence primitives at various rates. Sequence primitives differ from each other in their preference for other primitives with which to self-assemble.

2. Synthesis of new primitives: Assume that monomers can be incorperated producing new kinds of sequence primitive e.g. a new 'a' or a new 'b', however, it is extremely unlikely that monomer incorperation will in one step be able to copy more than one sequence type. e.g. producing a new 'ab', instead we assume synthesis of new primitives is only capable one at a time. The distribution of these new primitives will not be random but will be a function of the number of exposed sequence primitives already in existence and the affinity of sequence primitives for forming templates. The number of exposed sequence primitives may itself be a function of the composition of sequences within the reactor. Shorter templates will be more likely to be single stranded and serve as a template for new monomer formation at any given temperature. Thus, a reasonable simplification is to make the generation of new monomer types a decreasing function of the length of templates posessing such primitives, e.g. aaaaaa is less likely to be involved in templating a new a than bb is to template a new b. This is because aaaaaa is more likely to be double stranded. The affinity of sequence primitives is a dependent on the metabolic organization which for now we assume is fixed.
Second consider the operations that do transmit existing sequence information and so may be able to mediate modular heredity.
3. Templated Elongation by Monomer Attachment: Existing templates can be elongated by monomer attachment. Again, this depends on the number of exposed sites at which this can happen, and the affinity of the monomer to attach at that site. This is a means of transmitting existing sequence information since a template must have been involved in the formation of this bond.
4. Template Elongation by Sequence Attachment: Again, this will depend on the number of exposed sites. This is influenced by the number of staggered splint junctions that a sequence is capable of being involved in, as it is this type of structure that allows this kind of elongation. Information can be transmitted in this process since a template determines the sequences that can form. The probability of such an event will thus depend on the presence of complementary tri-molecular complexes. The calculation of the probability of formation of such trimolecular complexes is not a trivial task, but we can make an abstraction here too.
Third consider the operations that may allow elongating templates to break into shorter sections, that are again active.
5. Cleavage: Assume that cleavage is a function of sequence length, but also a function of sequence composition. Certain sequences are more likely to break than other sequences. In the absence of any prior knowledge about what these sequences are, I decide on a subset of sequences capable of cutting themselves at their extremities.
What next?
The next step is to model the above processes, one by one, in a fast model, capable of generating a large number of sequences and exploring the feedback processes that can influence the exploration of sequence space. A simulation much faster than my previous template replication model is necessary for this to be achieved.
How to Measure the Extent of 'Heredity' or 'Convergent Synthesis of Motifs' in a Population of Sequences.
The sequence population consists of motifs, some of which may have been copied or may have undergone convergent synthesis (i.e. a self-assembly process tending to produce non-random sequences). Irrespective of the mechanism involved, we would like to have a method that can detect motifs that are likely to exist in the population by non-random processes, i.e. that are unlikely to exist in such a copy number by a purely random process. Motifs may not be copied perfectly, and so the method we use for detecting copies must take account of the possibility of errors in either self-assembly or replication.
To see how to do this we must learn some bioinformatics. Discussed by Jones and Pevzner and by many others, is the motif finding problem. We want to find the correct alignment matrix of all sequences in the population such that the consensus score is maximized. The example of the concensus score is shown below. It is the sum of the frequencies of the most common nucleotide at each column of the motif of length L.
A similar problem is the Median String Problem. This is based on the Hamming distance between two strings, i.e. the number of positions at which they differ, given a particular alignment. The problem is to find the median string that minimizes the TotalDistance, i.e. the minimum possible total Hamming distance between a string v and a string s over all possible starting positions of s.
We would like to know what the probability of finding such a motif with this consensus score is, given that the strings were randomly generated, and then know how unlikely it would be to generate this consensus score by chance. Alternatively, we would like to know how unlikely it is that a given median string of length L has a low totalDistance by chance.
A problem with these kind of algorithms is that they do not take the possibility of substitutions or insertions into account. BLAST and other local alignment algorithms can do this, but it is necessary to give the algorithm a desired target sequence that one is trying to find alignments with. This is not what we want. We want to know ALL the motifs (of all different lengths) embedded in the population, and their divergence from random, but taking into account the probability of substitutions, insertions and deletions. A simple program can be written that considers copies that differ only by substitutions.
Mathematica is suitable for some string manipulation algorithms. See an excellent tutorial here by Brian Higgins from UC Davis. My attempt to write a motif search algorithm using mathematica is downloadable here.
