Representational Trajectories in Connectionist Learning

                                        Andy Clark
                        School of Cognitive and Computing Sciences,
                                   University of Sussex
                                     Brighton, BN1 9QH


                                         CSRP 292

                                         Abstract

The paper considers the problems involved in getting neural networks to learn
about highly structured task domains. A central problem concerns the tendency
of networks to learn only a set of shallow (non-generalizable) representations
for the task i.e. to 'miss' the deep organizing features of the domain.
Various solutions are examined including task specific network configuration
and incremental learning. The latter strategy is the more attractive since it
holds out the promise of a task-independent solution to the problem. Once we
see exactly how the solution works, however, it becomes clear that it is
limited to a special class of cases in which (l) statistically driven
undersampling is (luckily) equivalent to task decomposition, and (2) the
dangers of unlearning are somehow being minimized. The technique is suggestive
nonetheless, for a variety of developmental factors may yield the functional
equivalent of both statistical AND 'informed' undersampling in early learning.

Key words:    Connectionism, learning, development, recurrent networks, unlearning,
              catastrophic forgetting.


0. An Impossible Question.

       Some questions are just impossible. They are impossible because the
question is asked in a kind of vacuum which leaves crucial parameters
unspecified. The customer who goes into a shop and demands "How much?" without
specifying the item concerned poses just such an impossible question. So too I
shall argue, does the philosopher or cognitive scientist who asks, baldly,
"What kind of problem solving comes naturally to a connectionist system?" Yet
just such a question seems close to the surface of many recent debates. (For
especially clear examples, see the opening paragraphs of most papers offering
hybrid (Connectionist/classical) cognitive models; e.g. Cooper and Franks
(1991), Finch and Chater (1991).) It is not my purpose, in what follows, to
argue that there are no generic differences between connectionist and
classical approaches, nor that such differences, carefully understood, might
not indeed motivate a hybrid approach. The goal is rather to demonstrate,
using concrete examples, the wide range of parameters which need to be taken
into account before deciding, of some particular problem, whether it is or is
not a suitable case for connectionist treatment.

       The main claim of the paper is that the question of what comes
'naturally' to connectionist systems cannot be resolved by reflection on the
nature of connectionist architecture and processing alone. Instead, a variety
of (at least superficially) disparate parameters all turn out to be fully
interchangeable determinants of whether or not a network approach will succeed
at a given task. Such parameters include (1) the overall configuration of the
network (e.g. number of units, layers, modules etc.), (2) the nature of the
input primitives (3) the temporal sequence of training and (4) the temporal
development of the network configuration (if any). Some of these parameters
are obvious enough, though insufficient attention is paid to them in actual
debates (e.g. (1) and (2)). Others are less obvious but, as we shall see, no
less important.

       The strategy of the paper is as follows. I begin (section 1) by
detailing 3 cases in which networks fail to solve a target problem.  In
section 2 I rehearse a variety of different ways in which such shortcomings
may be remedied. These remedies involve tweaking the various neglected
parameters mentioned above. I go on (section 3) to show how these rather
specific instances bear on a much bigger issue, viz. how cognitive development
depends not just on the learning device and data, but on the further
'scaffolding' provided by course of training, selective attention,
sensorimotor development, linguistic surround and other factors.


1. Net Failures.

       Sometimes the failure of a system is more instructive than (would have
been) its success. A case in point is Norris' (1990) (1991) attempt to use a
multi-layer, feedforward connectionist network to model date calculation as
performed by 'idiot savants'. These are people who, despite a low general
intelligence, are able to perform quite remarkable feats of specific problem
solving. In the date calculation case, this involves telling you, for almost
any date you care to name (e.g. November 20 in the year 123,000!), what day of
the week (Monday, Tuesday etc.) it falls on. The best such date calculators
can successfully perform this task for dates in years which I can hardly
pronounce, the top limit being about the year 123470. Norris conjectured that
since idiot savant date calculators are solving the problem despite low
general intelligence, it might be that they are using just what Norris
describes as a 'low level learning algorithm' like backpropagation of error in
a connectionist net (see Norris (1991) p.294). The task, however, turned out
to be surprisingly resistant to connectionist learning.

       Norris initially took a 3-layer network and trained it on 20% of all
the day/date combinations in a fifty-year period. The result was uninspiring.
The network learned the training cases by rote, but failed to generalise to
any other dates. Perhaps the fault lay with some simple aspect of the
configuration? Norris tried permutations of numbers of units, numbers of
layers of hidden units etc. To no avail. Is the date calculation problem
therefore one which connectionist models are not 'naturally equipped' to
solve?. The issue is much more complex, as we shall see.

        Here is a second example. Elman (1991-a) describes his initial
attempts to persuade a 3-layer network to learn about the grammatical
structure of a simple artificial language. The language exhibited grammatical
features including verb/subject number agreement, multiple clause embedding
and long distance dependencies (e.g. of number agreement across embedded
clauses). The network belonged to the class of so-called 'recurrent' networks
and hence exploited an additional group of input units whose task is to feed
the network a copy of the hidden unit activation pattern from the previous
processing cycle alongside the new input. In effect, such nets have a
functional analogue of short-term memory: they are reminded, when fed input 2,
of the state evoked in them by input 1, and so on. The task of the net was to
take as input a sentence fragment and to produce as output an acceptable
successor item i.e. one which satisfies any grammatical constraints (e.g. of
verb/number agreement) set up by the input. Alas, the Elman net, too, failed
at its task. It failed completely to generalise to new cases (i.e. to deal
with inputs not explicitly given during training) and got only a badly
incomplete grip on the training cases themselves. The network had failed to
learn to use its resources (of units and weights) to encode knowledge of the
deep organising features of the domain-features such as whether or not an
input was singular or plural. (Such a failure can be demonstrated by the use
of techniques such as principal components analysis - see Gonzalez and Winch
(1977), Elman (1991-b) The result was a network whose internal representations
failed to fix on the grammatical features essential to solving the problem.
Does this kind of grammar therefore represent another problem space which
connectionist learning (even using the relatively advanced resources of a
recurrent network) is fundamentally ill-suited to penetrate? Once again, the
issue is considerably more complicated as we shall now see.


2. How to Learn the Right Thing.

       One way of solving a learning problem is, in effect, to give up on it.
Thus it could be argued that certain features are simply unlearnable, by
connectionist means, on the basis of certain bodies of training data, and
hence that the 'answer' is either to give up on connectionist learning (for
that task) or to build more of the target knowledge into the training data in
net-accessible ways. Thornton (1991) suggests that networks can only learn
features which are present in the first-order statistics of a training set.
First-order statistics cover e.g. the number of occurrences of an item in a
set, while second-order ones cover e.g. the number of items which have some
first-order statistical frequency in the set i.e. they are statistics about
statistics. The input description language, if this is right, must always be
such that the target knowledge is statistically first order with respect to
it. Hence very often the solution to a learning failure will be to alter the
input description language. This is an interesting claim, and one pursued
further in Clark (forthcoming). Nonetheless, the input description language,
although it no doubt could be manipulated to solve many instances of network
failure, need not always be tampered with. In the present section I examine a
variety of ways of dealing with the failures described in section 1 which keep
the training corpus (and hence the input description language) fixed, but
instead manipulate one or other of a variety of different, often neglected,
parameters. The goal will be to show first, that such corpus-preserving fixes
are possible; second, that the various fixes are pretty much interchangeable
(and hence, I believe, that the roles of architecture, input code and training
sequence are not fundamentally distinct in connectionist learning); and third,
that the problems (from section 1) and the fixes described have more in common
than meets the eye. By the end of the section it should be increasingly clear
just how complex the question of what comes naturally to a connectionist
system really is.

       The first fix I want to consider is by far the most brutal. It involves
pre- configuring the network in a highly problem specific way. Recall Norris'
unsuccessful attempt to model date calculation. To generate a successful
model, Norris reflected on the logical form of a particular date calculation
algorithm. The algorithm involved three steps. First, day/date pairings are
specified (by rote) for a base month (say, November 1957). Second, offsets are
learnt to allow the generalisation of the base month knowledge to all the
other months in that year (1957). Finally, offsets between years are learnt
(i.e. a one- day offset between consecutive years, modulo leap years). With
this algorithm in mind, Norris chose a global configuration comprising 3
distinct sub-nets, one of which would be trained on each distinct sub-task
(i.e. base month modelling, base year transformations, and cross-year
transformations). Each sub-net was trained to perform its own specific part of
the task (in logical sequence) and learning in it was stopped before training
the next sub- net. Thus learning was stopped in sub-net 1 before training
sub-net 2 and so on. Sub-net 2 would take output from sub-net 1 and transform
it as needed, and sub-net 3 would take output from 2 and do the same.

       The upshot of this pre-configuration and training management was,
perhaps unsurprisingly, a system capable of solving the problem in a fully
generalizable manner. The final system was about 90% accurate, failing mainly
on leap-year cases of the same kind as cause problems for human date
calculators (see Norris (1991) p.295).

       This result is, however, at best only mildly encouraging. It shows that
the problem space can be negotiated by connectionist learning. And true, the
solution does not require amending the input description language itself. But
the solution depends on a task-specific configuration (and training regime)
which is bought only by drastic human intervention. Such intervention
(assuming the goal is to develop good psychological models of human problem
solving) is, as far as I can see, legitimate only if either;
   (a) we can reasonably suppose that the long-term processes of biological
evolution in
       the species have pre-configured our own neural resources in analogous
ways,

or (b) we are assuming that the configuring etc. can be automatically
achieved, in
       individual cognitive development, by processes as yet unmodelled.

In the case at hand, (a) seems somewhat implausible. The real question that
faces us is thus whether there exist fixes which do not depend on unacceptable
kinds of human intervention. Recent work by Jeffrey Elman suggests that the
answer is a tentative 'yes' and that the key lies in (what I shall label) the
scaffolding of a representational trajectory. Hence we move to our second fix
viz. manipulating the training.


       Recall Elman's failed attempt to get a recurrent network to learn the
key features of a simple grammar. One way of solving the problem is, it turns
out, to divide the training corpus into graded batches and to train the
network by exposure to a sequence of such batches beginning with one
containing only the simplest sentence structures and culminating with one
containing the most complex ones. Thus the net is first trained on 10,000
sentences exhibiting e.g. verb/subject number agreement but without any
relative clauses, long distance embeddings etc., and then it is gradually
introduced to more and more complex cases. The introduction of the
progressively more complex cases was gradual in two senses. First, insofar as
it was accomplished by grading the sentences into five levels of complexity
and exposing the net to example batches at each level in turn. And second
(this will be important later on) because the network was 'reminded', at each
subsequent stage of training, of the kinds of sentence structure it had seen
in the earlier stages. Thus, for example. stage 1 consisted of exposure to
10,000 very simple sentences, and stage 2 consisted of exposure to 2,500
sentences of a more complex kind plus 7,500 (new) very simple cases.

       This 'phased training' regime enables the network to solve the problem,
i.e. to learn the key features of the artificial language. And it does so
without amending the basic architecture of the system and without changing the
content of the corpus or the form of the input code. What makes the
difference, it seems, is solely the sequential order of the training cases.
Why should this be so effective? The answer, according to Elman, is that
phasing the training allows the network to spot, in the early stages, the most
basic domain rules and features (e.g. the idea of singular and plural and of
verb/subject number agreement). Knowing these basic rules and features, the
net has a much smaller logical space to search when faced with the more
complex cases. It is thus able to 'constrain the solution space to just that
region which contains the true solution'(Elman (1991a) p.8. By contrast, the
original net (which did not have the benefit of phased training) saw some very
complex cases right at the start. These forced it to search wildly for
solutions to problems which in fact depended on the solutions to other simpler
problems. As a result it generated lots of 'ad hoc' small hypotheses which
then (ironically) would serve to obscure even the grammatical structure of the
simple cases. Such a net is, in effect, trying to run before it can walk, and
with the usual disastrous consequences.

       At this point we begin to see a common thread uniting the grammar case
and the date calculation case. Both domains require, in a very broad sense,
hierarchical problem solving. In each case there is a problem domain which
requires, for its successful negotiation, that a system de-compose the overall
problem into an ordered series of sub- problems. In the grammar case, this
involves first solving e.g. the verb/subject number agreement problem and only
later attacking the problem of relative clauses. In the date calculation case
it involves e.g. first solving the problem for the base year, and only later
attacking the problem of other years. We can express the general moral, made
explicit in Elman (1991-a), like this:

       Elman's Representational Trajectory Hypothesis
    There is a class of domains in which certain problem solutions act as the
    'building blocks' for the solutions to other, more complex problems. In
    such domains connectionist learning is efficient only if the overall
    problem domain can somehow be de-composed and presented to the net in an
    ordered sequence. In the absence of such de-composition the basic
    regularities (the 'building blocks') are obscured by the net's wild
    attempts to solve the more complex problems, and the more complex
    problems are, practically speaking, insoluble.

       The solution, as we have seen, is to somehow sculpt the network's
representational trajectory i.e. to force it to solve the 'building block'
problems first. This can be achieved either by direct manipulation of the
architecture and training (Norris) or by 'scaffolding' a network by the
careful manipulation of the training date alone (Elman). The Norris solution,
however, was seen to involve undesirable amounts of problem-specific human
intervention. The phased training solution is a little better insofar as it
does not require problem-specific pre-configuration of the architecture. The
third and final fix I want to consider is one which involves neither phasing
the training nor pre-configuring the architecture to suit the problem. It is
what Elman calls 'phasing the memory' and it represents the closest
approximation to the ideal of an automatic means of sculpting the
representational trajectory of a network.

       Recall that short-term memory, in the Elman network, is given by a set
of so- called context units whose task is to copy back, alongside the next
input to the net, a replica of the hidden unit activation pattern from the
previous cycle. The 'phased memory' fix involves beginning by depriving the
network of much of this feedback, and then slowly (as training continues)
allowing it more and more until finally the net has the full feedback
resources of the original. The feedback deprivation worked by setting the
context units to 0.5 (i.e. eliminating informative feedback) after a set
number of words had been given as input. Once again, there were five phases
involved. But this time the training data was not sorted into simple and
complex batches. Instead, a fully mixed batch was presented every time. The
phases were as follows.

Phase 1: feedback eliminated after every 3rd/4th word (randomly).
Phase 2: feedback eliminated after every 4th/5th word (randomly).
Phase 3: feedback eliminated after every 5th/6th word (randomly).
Phase 4: feedback eliminated after every 6th/7th word (randomly).
Phase 5: full feedback allowed (i.e. the same as the original
net used in the earlier studies.)

       In short, we have a net which, as Elman puts it, 'starts small' and
develops, over time, until it reaches the effective configuration of the
original recurrent net. This 'growing' network, although exposed to fully
mixed sentence-types at all stages, is nonetheless able to learn the
artificial grammar just as well as did the 'phased training' net. Why should
this be so? The reason seems to be that the early memory limitations block the
net's initial access to the full complexities of the input data hence it is
not able to be tempted to thrash around seeking the principles which explain
the complex sentences. Instead, the early learning can only target those
sentences and sentence fragments whose grammatical structure is visible in a
4/5 word window. Unsurprisingly, these are mostly the simple sentences, i.e.
ones which exhibit such properties as verb/subject number agreement but do not
display e.g. long distance dependencies, embeddings, etc. The 'phased memory'
solution thus has the same functional effect as the phased learning viz. it
automatically decomposes the net's task into a well ordered series of
sub-tasks (first agreement, then embeddings etc.). The key to success, we saw,
is to somehow or other achieve task-decomposition. The great attraction of the
'phased memory' strategy is that the decomposition is automatic - it does not
require task-specific human intervention (unlike e.g. the Norris solution or
the phased training solution).

       It is always reassuring to learn that the use of limited resources (as
in the net's early memory limitations) can bring positive benefits. In
suggesting a precise way in which early cognitive limitations may play a
crucial role in enabling a system to learn about certain kinds of domain,
Elman's work locates itself alongside E.Newport's (1988)(1990) studies
concerning the explanation of young children's powerful language acquisition
skills. It is worth closing the present section by summarizing this work as it
helps reveal important features of the general approach.

       Newport was concerned to develop an alternative explanation of young
children's facility at language acquisition.  Instead of positing an innate
endowment which simply decays over time, Newport proposed what she labelled a
"Less is More" hypothesis viz. that:

       The very limitations of the young child's information processing
       abilities provide the basis on which successful language acquisition
       occurs.
                                    Newport (1990)p.23.

       (Note: the "less is more" hypothesis remains neutral on the question of
whether a significant innate endowment operates.  What it insists on is that
the age-related decline in language acquisition skills is not caused by the
decay of such an endowment).

       The key intellectual limitation which (paradoxically) helps the young
child learn is identified by Newport as the reduced ability to "accurately
perceive and remember complex stimuli" (op. cit. p.24). Adults bring a larger
short term memory to bear on the task, and this inhibits learning since it
results in the storage of whole linguistic structures which then need to be
analysed into significant components.  By contrast, Newport suggests, the
child's perceptual limitations automatically highlight the relevant
components. The child's reduced perceptual window thus picks out precisely the
"particular units which a less limited learner could only find by
computational means" (op.cit. p.25).

       Some evidence for such a view comes from studies of the different error
patterns exhibited by late and early learners.  Late learners (generally
second language learners) tend to produce inappropriate "frozen wholes". Early
learners tend to produce parts of complex structures with whole components
missing.

       The idea, then, is that children's perceptual limitations obviate the
need for a computational step in which the components of a complex linguistic
structure are isolated. As a result:

       Young children and adults exposed to similar linguistic environments
       may nevertheless have very different internal databases on which to
       perform a linguistic analysis.
                                   Newport (1990) p.26.

       The gross data to which children and adults are exposed is thus the
same, but the effective data, for the child, is more closely tailored to the
learning task (for more on this theme, see Clark (forthcoming, Ch.9)).

       Newport's conjectures and Elman's computational demonstration thus
present an interesting new perspective on the explanation of successful
learning.  This perspective should be of special interest to developmental
cognitive psychology.  In the next section I try to expand and clarify the
developmental dimensions while at the same time raising some problems
concerning the generality of the "phased memory" and "less is more" solutions.

3. The Bigger Picture: Scaffolding and Development.

       Faced with a hierarchically structured problem domain connectionist
networks have, we saw, a distressing tendency to get 'lost in space(s)'. They
try to solve for all the observed regularities at once, and hence solve for
none of them. The remedy is to sculpt the network's representational
trajectory; to force it to focus on the 'building block' regularities first.
The ways of achieving this are quite remarkably various, as demonstrated in
section 2. It can be achieved by direct configuration of the architecture into
task- specific sub-nets, or by re-designing the input code, or by fixing the
training sequence, or by phasing the memory. In fact, the variety of
parameters whose setting could make all the difference is, I believe, even
larger than it already appears. To see this, notice first that the mechanism
by which both the Elman solutions work is undersampling. The network begins by
looking at only a subset of the training corpus. But actual physical growth
(as in the incremental expansion of the memory) is not necessary in order to
achieve such initial undersampling, even supposing that no interference with
the training corpus (no sorting into batches) is allowed. The heart of the
'phased memory' solution is not physical growth so much as progressive
resource allocation. And this could be achieved even in a system which had
already developed its full, mature resources. All that is required is that,
when the system first attends to the problem, it should not allocate all these
resources to its solution. In the Elman net, the memory feedback was initially
reduced by setting the context units to 0.5 after every 4/5 words. A similar
effect would be obtained by adding noise after every 4/5 words. Even switching
attention to a different problem would do this, since relative to the grammar
problem, the new inputs (and hence the subsequent state of the context units)
would be mere noise. Rapidly switching attention between two problems might
then have the same beneficial effect as phasing the overall memory. Limited
attention span in early infancy might thus be a positive factor in learning,
as might the deliberate curtailing of early efforts at problem solution in
adult cognition. In general, it seems possible that one functional role of
salience and selective attention may be to provide precisely the kind of input
filter on which the phased memory result rests. (In the case of learning a
grammar it is worth wondering whether the fact that a young child cares most
about the kinds of content in fact carried by the simple sentences may play
just such a functional role i.e. the child's interests yield a selective
filter which results in a beneficial undersampling of the data. There is a
kind of 'virtuous circle' here, since what the child can care about will, to
an extent, be determined by what she can already understand.)
       Less speculatively, Elman himself notes (personal communication) that
there are mechanisms besides actual synaptic growth which might provide a
physical basis for early undersampling e.g. delays in cortical myelinization
resulting in high noise levels along the poorly myelinated pathways.

       A further developmental factor capable of yielding early undersampling
is the gradual development of physical motor skills. This provides us with a
staged series of experiences of manipulating our environment, with complex
manipulations coming after simple ones. Once again, it may be that this
automatic phasing of our learning is crucial to our eventual appreciation of
the nature of the behaviour of objects (i.e. to the development of a 'naive
physics' - see e.g. Hayes (1985)).

       Going deeper still, it is worth recalling the functional role of
undersampling. The role is to enable the system to fix on an initial set of
weights (i.e. some initial domain knowledge) which serve to constrain the
search space explored later when it is faced with more complex regularities.
As Elman puts it "the effect of early learning ... is to constrain the
solution space to a much smaller region" (Elman (1991 -a) p.7) - i.e. to one
containing fewer local minima. Given this role, however, we can see that at
least three other factors could play the same role. They are (1) the presence
in the initial system of any kinds of useful innate knowledge i.e. (in
connectionist terms) any pre-setting of weights and/or pre- configuration of
networks which partially solves the problems in a given domain. (Obviously
this solution will be plausible only in central and basic problem domains e.g.
vision and language acquisition); (2) the set of basic domain divisions
already embodied in our public language. The idea here is that in seeking to
solve a given problem we will often (always?) deploy knowledge of types and
categories - knowledge which is in part determined by the types and categories
dignified by labels in whatever public language we have learnt (Plunkett and
Sinha's (1991) picture of language as a 'semantic scaffold' captures much of
what I have in mind); and finally, (3), the whole thrust of culture and
teaching which is, in a sense, to enable the communal sharing of 'building
block' knowledge and hence to reduce the search space of the novice.

       The catalogue of speculations could be continued, but the effective
moral is already clear. It is that attention to the basic mechanisms
highlighted by the Elman experiments reveals a unifying thread for a
superficially disparate bag of factors which have occupied cognitive
developmental psychology since time immemorial (well, 1934 at least). What we
need to understand, before we venture to pronounce on what connectionist
networks will or won't be able to learn, is, it seems, nothing less than how
cognitive development is 'scaffolded' by innate knowledge, culture and public
language, and how broadly maturational processes and processes of individual
learning (including selective attention) interrelate. Connectionism and
developmental psychology are thus headed for a forced union to the benefit of
both parties. That, however, is the good news. I want to close this section by
looking at the downside and highlighting two limitations on Elman-style
solutions.

       The first is a limitation which afflicts any 'phased memory' approach.
It is that phasing the memory can only be effective in cases where as a matter
of fact (i.e. 'as luck would have it'!) merely statistically-driven
undersampling of a training corpus is equivalent to task de-composition. Thus
it so happens, in the artificial grammar domain, that an initial 4/5 word
window isolates the set of training data necessary to induce the basic
'building block' rules of the domain. But as the old refrain goes, it ain't
necessarily so. There are many domains in which such unintelligent
undersampling of the data would not yield a useful subset of the corpus.

       It is here that reflection on Newport's work can help clarify matters.
For suppose we ask: Is the posited "fit" between the young child's perceptual
window and the componential structure of natural language just a lucky
coincidence? In the case of natural language, the answer is plausibly "no".
For, as Newport (op. cit p.25) notes, it is tempting to suppose that the
componential form of natural language has itself evolved so as to exploit the
kind of perceptual window which the young child brings to bear on the learning
task.  Morphology may have evolved in the light of the young child's early
perceptual limitations. Natural language may thus present a special kind of
domain in which the learning problem has been posed precisely with an eye to
the short-cut computational strategies made available by the early
limitations.

       The "phased memory"/"less is more" style of solution is thus quite
plausible for any cases in which the problem may itself have been selected
with our early limitations "in view". But its extension beyond such domains is
questionable, as it would require a lucky coincidence between the componential
structure of a target domain and the automatic windowing provided by early
perceptual limits.

       The second limitation is one which threatens to afflict the whole
connectionist approach. It is the problem of unlearning or 'catastrophic
forgetting' (French (1991)). Very briefly the problem is that the basic power
of connectionist learning lies in its ability to buy generalisation by storing
distributed representations of training instances superposit- ionally i.e.
using overlapping resources of units and weights to store traces of
semantically similar items. (For the full story, see Clark (1989)) The upshot
of this is that it is always possible, when storing new knowledge, that the
amended weights will in effect blank out the old knowledge. This will be a
danger whenever the new item is semantically similar to an old one, as these
are the cases where the net should store the new knowledge across many of the
same weights and connections as the old. Vulnerability of old knowledge to new
knowledge sets such networks up for a truly (Monty) Pythonesque fate, viz.
exposure to one 'deadly' input could effectively wipe out all the knowledge
stored in a careful and hard-won orchestration of weights. The phenomenon is
akin to the idea of a deadly joke upon the hearing of which the human
cognitive apparatus would be paralysed or destroyed. For a connectionist
network, such a scenario is not altogether fanciful. As French comments:

       Even when a network is nowhere near its theoretical storage capacity,
       learning a single new input can completely disrupt all of the
       previously learned information.
                                    French (1991) p.4

The potential disruption is a direct result of the superpositional storage
technique. It is thus of a piece with the capacities of 'free generalisation'
which makes such nets attractive. (It is not caused by any saturation of the
net's resources such that there is no room to store new knowledge without
deleting the old.) Current networks are protected from the unlearning by a
very artificial device, viz. the complete interweaving of the training set.
The full set of training cases is cycled past the net again and again, so it
is forced to find an orchestration of weights which can fit all the inputs.
Thus in a corpus of three facts A, B and C, training will proceed by the
successive repetition of the triple <A,B,C> and not e.g. by training to
success on A, then passing to B and finally to C. Yet this, on the face of it,
(but see below) is exactly what the phased training /phased memory solutions
involve! The spectre of unlearning was directly and artificially controlled in
the Norris experiment by stoping all learning in a successful sub-net.  As
Norris commented:

    When subsequent stages start to learn they naturally begin by
    performing very badly. The learning algorithm responds by adjusting the
    weights in the network. ... If learning had been left enabled in the early
    nets then their weights would also have been changed and they would have
    unlearned their part of the task before the final stage had learned its
    part.

                                Norris (1991) p.295.

       What protects the Elman nets from this dire effect? The answer, in the
case of phased training, seems to be that Elman protects the initial 'building
block' knowledge by only allowing the complex cases in gradually, alongside
some quite heavy duty reminders of the basics.  Thus at phase 2 the net sees a
corpus comprising 25% of complex sentences alongside 75% of new simple
sentences. Similarly, in the phased memory case, the net at phase two sees a
random mix of 4 and 5 word fragments thus gradually allowing in a few more
complex cases alongside reminders of the basics. In short, the current
vulnerability of nets to unlearning requires us to somehow insulate the vital
representational products of early learning from de-stabilization by the net's
own first attempts to deal with more complex cases. Such insulation does not
seem altogether psychologically realistic, and marks at least one respect in
which such networks may be even more sensitive to representational trajectory
than their human counterparts.

       The bigger picture, then, is a mixed bag of pros and cons. On the plus
side , we have seen how the broad picture of the vital role of
representational trajectories in connectionist learning makes unified sense of
a superficially disparate set of developmental factors. On the minus side, we
have seen that the phased memory solution is limited to domains in which
merely statistically-driven undersampling is luckily equivalent to task
decomposition, and that the endemic vulnerability of networks to unlearning
makes the step-wise acquisition of domain knowledge an especially (and perhaps
psychologically unrealistically) delicate operation.


4. Conclusions, and an Aside


       We began with an impossible question - "What comes naturally to a
connectionist system?". Understood as a question about what kinds of
theoretical space may be amenable to the connectionist treatment, this
question was seen to be dangerously underspecified. The question becomes
tractable only once a variety of parameters are fixed. These were seen to
include obvious items such as the large scale configuration of the system
(into sub- nets etc.), and also less obvious ones such as whether training is
phased and whether the mature state is reached by a process of incremental
'growth'. The effects of these less obvious and superficially more peripheral
factors were seen to be functionally equivalent to those involving the large
scale configuration. The key to success, in all cases, was to somehow help the
network to decompose a task into an ordered series of sub-tasks. In the
absence of such decomposition, we saw that networks have a tendency to get
'lost in space(s)'. They try to account for all the regularities in the data
at once; but some of the regularities involve others as 'building blocks'. The
result is a kind of snow blindness in which the net cannot see the higher
order regularities because it lacks the building blocks, but nor can it
isolate these as it is constantly led off track by its doomed efforts to
capture the higher-order regularities.

       The negotiation of complex theoretical spaces, then, is a delicate
matter. Connectionist learning needs, in such cases, to be scaffolded. We saw
(section 3) that the functional role of the kinds of scaffolding investigated
by Elman and Newport (section 2) could be mimicked by a wide variety of
superficially distinct developmental factors. I note as a final aside that
this intimacy between connectionist learning and several much wider
developmental factors begins to suggest a possible flaw in the so-called
'systematicity argument' presented in Fodor and Pylyshyn (1988). That
argument, recall,  begins by defining a notion of systematic cognition such
that a thinker counts as a systematic cognizer just in case her potential
thoughts form a fully interanimated web. More precisely, a thinker is
systematic if her potential thoughts form a kind of closed set i.e. if, being
capable of, say, the thoughts "A has property F" and "B has property G" she is
also capable of having the thoughts "A has property G" and "B has property F".
A similar closure of relational thoughts is required, as illustrated by the
overused pair "John loves Mary" and "Mary loves John". The notion of
systematicity, then, is really a notion of closure of a set of potential
thoughts under processes of logical combination and re-combination of their
component parts.

       There are many pressing issues here - not least the extent to which
daily concepts and ideas such as  "loves" and "John" can be properly supposed
to isolate component parts of thoughts (see Clark (forthcoming) for a full
discussion). For our purposes, however, it will be sufficient to highlight a
much more basic defect. To do so, we must look at the argument in which the
notion of systematicity operates. It goes like this:

1.     Human thought is systematic

2.     Such systematicity comes naturally so systems using classical
       structured representations and logical processes of symbol
       manipulation.

3.     It does not come naturally to systems using connectionist
       representations and vector to vector transformations.

4.     Hence classicism offers a better model (at the cognitive
       psychological level) than connectionism.

       The argument is, of course, only put forward as an inference to the best explanation,
hence it would be unfair to demand that it be logically valid.  But even as
inference to the best explanation it is surely very shaky.  For as we have
seen it is a mistake to suppose that the question of what kind of thing a
connectionist network will learn is to be settled by reference to the generic
form of the architecture and/or learning rules.  Many other parameters (such
as the system's development over time) may be equal determinants of the kind
of knowledge it acquires.  Even the observation of a pervasive feature of
mature human cognition (e.g. systematicity) need not demand explanation in
terms of the basic cognitive architecture.  It could instead be a reliable
effect of the regular combination of e.g. a basic connectionist architecture
with one or more of a variety of specific developmental factors, (including,
surely, the effects of learning a systematic public language (see Dennett
(1991) for an argument that language learning is the root of such
systematicity as human thought actually displays)). The contrast that I want
to highlight is thus the contrast between systematicity as something forced
onto a creature by the basic form of its cognitive architecture versus
systematicity as a feature of a domain or domains i.e. as something to be
learnt about by the creature as it tries to make sense of a body of training
data. The suggestion (and it is no more than that) is thus that we might
profitably view systematicity as a knowledge-driven achievement and hence as
one in principle highly sensitive to the setting of all the various parameters
to which connectionist learning has been shown (sections 1 - 3 above) to be
sensitive. What I am considering is thus a kind of 'gestalt flip in our
thinking about systematicity. Instead of viewing it as a property to be
induced directly and inescapably by our choice of basic underlying
architecture, why not try thinking of it as an aspect of the knowledge we
(sometimes) want a network to acquire? In so doing we would be treating the
space of systematically interanimated concepts as just another theoretical
space, a space which may one day be learnt about by a (no doubt highly
scaffolded) connectionist learning device. The mature knowledge of such a
system will be expressible in terms of a (largely) systematically interwoven
set of concepts. But the systematicity will be learnt as a feature of the
meanings of the concepts involved. It will flow not from the shallow closure
of a logical system under recombinative rules, but from hard-won knowledge of
the nature of the domain.

       A "might" is alas well short of a proof. To make the case stick we
would need to show in detail exactly how systematicity could arise as a
function of a well-scaffolded trajectory of connectionist learning.  What we
can already say is just this: that given the demonstrable sensitivity of
connectionist learning to the settings of an unexpectedly wide variety of
developmental parameters, we should beware of arguments which, like Fodor and
Pylyshyn's, marginalize the role of such factors and focus attention only on
the combination of basic architecture and gross training inputs.

       To sum up, I have tried to show (a) that connectionist learning is
highly sensitive to differences in 'representational trajectory' - i. e. to
the temporal sequence of problem solutions, (b) that a surprisingly wide
variety of factors (network growth, motor development, salience and selective
attention etc. etc.) may be understood as means of sculpting such trajectories
and finally (c) that the debate about what kinds of problem domain are
amenable to a connectionist treatment is unreliable if pursued in a
developmental vacuum i.e. without reference to whatever mechanisms of
sculpting nature or culture may provide.

       So much for the high ground. Failing that, and with (I'm told) a
minimum of two new journals appearing every month, it should be a relief to
learn that sometimes, at least, undersampling the corpus is the key to
cognitive success.


Notes

1.     Versions of this paper were presented to the 1991 ANNUAL CONFERENCE OF
       THE BRITISH PSYCHOLOGICAL SOCIETY (Developmental Section) and the
       PERSPECTIVES ON MIND CONFERENCE. (Washington University in St. Louis,
       St. Louis, Miss. 1991). Thanks to the participants of those events for
       many useful and provocative comments and discussions. Thanks also to
       Margaret Boden, Christopher Thornton and all the members of the
       University of Sussex Cognitive Science Seminar, and to two anonymous
       referees whose comments and criticism proved invaluable in revising the
       original manuscript. Some of the material presented includes and
       expands upon Clark (forthcoming, ch.7) Thanks to MIT Press for
       permission to use it here.


                           B I B L I O G R A P H Y
Bechtel, W. and Abrahamsen, A. (1991)  Connectionism and the Mind, Oxford:
    Basil Blackwell.

Clark, A.(1989-a)  Microcognition: Philosophy, Cognitive Science and
    Parallel Distributed Processing, Cambridge, MA: MIT Press/ Bradford Books.

Clark, A.(1991) "In defence of explicit rules," in Philosophy and
    Connectionist  Theory, ed. W. Ramsey, S.Stich  & D.Rumelhart, New Jersey:
    Erlbaum.

Clark, A. and Karmiloff-Smith, A. (in press) The Cognizer's Innards:
    A Psychological and Philosophical Perspective on the Development of
    Thought. Mind and Language.

Clark, A. (forthcoming) Associative Engines: Connectionism,
    Concepts and Representational Change. Cambridge, MA: MIT Press/Bradford
    Books.

Cooper, R. and Franks, B. (1991). Interruptability: a new constraint on Hybrid
    Systems. in AISB Quarterly (Newsletter of the Society for the study of
    Artificial Intelligence and Simulation of Behaviour) Autumn/Winter 1991,
    no.78, pp.25-30.

Dennett, D.(1991) "Mother Nature versus the Walking Encyclopedia," in
    Philosophy and Connectionist Theory, ed. W.Ramsey, S.Stich and D.
    Rumelhart.pp. 21-30. Hillsdale, NJ.: Erlbaum.

Elman, J.(1991-a) "Incremental learning or the importance of starting  small,"
    Technical Report 9101, Center for Research in Language, University of
    California, San Diego.

Elman, J.(1991-b) "Distributed representations, simple recurrent networks and
       grammatical structure," Machine Learning.7, pp.195-225

Finch, S. and Chater, N. (1991) "A hybrid approach to the automatic learning
    of linguistic categories. AISB Quarterly, no.78, pp.16-24.

Fodor, J. and Pylyshyn, Z. (1988)" Connectionism and cognitive architecture.
    A critical analysis.,"  Cognition, no. 28, pp. 3-71.

French, R. M. (1991) "Using semi-distributed representations to overcome
    catastrophic forgetting in connectionist networks," CRCC Technical Report
    51, University of Indiana, Bloomington, Indiana 47408.

French, R.M. (1992) Semi-distributed representations and catastrophic
    forgetting in connectionist networks. Connection Science vol.4, nos. 3 and
    4, pp.365-377.

Gonzalez, R. and Wintz, P. (1977) Digital Image Processing, Reading, MA:
    Addison-Wesley

Hayes, P.(1985) "The second naive physics manifesto," in J. Jobbs and
    R. Moore, eds. Formal Theories of the Commonsense World, ed. Ablex, New
    Jersey.

Newport, E. (1988) "Constraints on learning and their role in language
    acquisition: studies of the acquisition of American sign language."
    Language Sciences , 10, pp.147-172.

Newport, E. (1990) "Maturational Constraints on language learning",  Cognitive
    Science,14, pp.11-28.

Norris, D. (1990) "How to build a connectionist idiot (savant)", Cognition,
    35, pp.277-291.

Norris, D. (1991) "The constraints on connectionism," The Psychologist,
    vol. 4, no.7, pp.293-296.

Plunkett, K. and C. Sinha (1991) "Connectionism and developmental theory,"
    Psykologisk Skriftserie Aarhus, vol. 16, no. 1, pp.1-77.

Thornton, C. (1991) "Why connectionist learning algorithms need to be more
    creative," Conference Preprints for the First Symposium on Artificial
    Intelligence, Reasoning and Paper presented to and Creativity (University
    of Queensland, August 1991).(Also CSRP 218, Cognitive Science Research
    Paper, University of Sussex).