Multiplication Number Facts: Modeling Human Performance With Connectionist Networks

abstract Three connectionist models of human performance on simple multiplication number facts, commonly called \times tables," are reviewed. Also, human data from normal subjects and brain-damaged patients, which constrain these models, are presented. These human data include the problem size eeect, error eeects, priming eeects, use of strategies and rules, and number representation. The connectionist models presented are: a simple auto-associator (J.A. Anderson's Brain-State-in-a-Box), a standard back-propagation model, and McCloskey and Lindemann's mathnet. The review of human data and connectionist models of memory retrieval provides some insight into the strengths of, diierences between , and challenges for, this approach to computational modeling. Particular attention is paid to the representation of number used by these models, and a related ability to generalize learning.


Introduction
For many years experimental psychologists employed simple arithmetic problems solely as distractor tasks.However, with the reawakening of interest in cognition, researchers have begun to use basic number facts to probe the nature of memory representation and retrieval processes.Number facts, especially sums and products of single digit numbers, have several advantages as experimental stimuli (e.g., practically everyone has learned addition and multiplication tables).The response time to answer various problems can be examined, and time di erences provide important clues to possible mental organizations and processes.Also, even adults, make occasional mistakes.Anderson (1995), commenting on how the human brain does arithmetic, uses the catchy phrase, \Not only is it slow, it is also inaccurate" (p.586).
We present a review of three connectionist models designed to predict human performance on multiplication number facts: J. A. Anderson's Brain-State-in-a-Box (bsb), a standard back-propagation model, and McCloskey and Lindermann's mathnet.These models, by no means the only ones addressing this area of cognition, were selected to highlight some strengths of, and challenges for, the connectionist approach, and the important contribution of stimulus representation.Other well developed models, such as Campbell's network-interference (see Campbell & Oliphant, 1992), are not included because they have not been explicitly implemented in the connectionist framework at this time, and excellent reviews exist (e.g., Ashcraft, 1992;Graham & Campbell, 1992;McCloskey, Harley, & Sokol, 1991).
Prior to describing the connectionist models, it is necessary to identify the human subject phenomena that constrain their performance.The rst section presents data on multiplication number fact performance across the life span, including single case studies of brain-damaged patients (see also Ashcraft, 1992;Dehaene, 1992;McCloskey, Harley, et al., 1991).In this brief review, we restrict our attention to data relevant to the models presented.The second section describes the models, including architecture, data representation, operation, and application to multiplication number facts.

Data from Human Subjects
Many studies have examined the number fact behavior of normal children and adults, and also the disruption of number fact performance in brain damaged patients.The subjects being tested have typically been asked to perform two types of tasks, veri cation and production.In a veri cation task, subjects are shown a problem (e.g., 3 5 = 16) and are asked to respond \true" if they believe the answer to be correct, or \false" otherwise.In a production task, subjects are shown only the operands and asked to supply the answer (e.g., 5 4 = ?).For both types of tasks, response time and errors are recorded and analyzed as dependent variables.Several researchers have posited that veri cation consists of production followed by an additional stage of comparison to the given answer (e.g., Ashcraft, 1987;Parkman, 1972).However, others maintain an alternative view that the stated answer in a veri cation task alters the process of retrieval and therefore, possibly, the outcome (Zbrodo & Logan, 1990).
There are some exceptions to the problem size e ect, for example, \ties" like 4 4 (Miller et al., 1984;Parkman, 1972).For these problems, response time for normal subjects is either constant, or increases only somewhat with larger operands.Although, in brain-damaged patients, impairment for multiplication facts with large operands is generally greater than that for problems with small operands (Mc-Closkey, Harley et al., 1991), impairment has been shown to be non-uniform (Mc-Closkey, Caramazza, & Basili, 1985;Warrington, 1982).
There are two theories commonly cited as the cause of the problem size e ect.The frequency theory (Ashcraft, 1992) proposes that, because small problems occur more often, frequency (i.e., practice) e ects yield stronger memory traces.The order of presentation theory, based on the fact that small problems are typically learned rst, posits proactive inhibition impedes learning when larger facts are presented, giving rise to order e ects (Campbell & Graham, 1985).A recent study of adults, showing that practice attenuates, but does not eliminate, the problem size e ect, is more consistent with the frequency theory than with the order of presentation theory (Fendrich, Healy, & Bourne, 1993).

Error E ects
The errors made by human subjects on multiplication number facts can be classi ed into several types, which are described below using the terminology of McCloskey, Harley, et al. (1991).

Operand Errors
The most common error is termed an operand error or an operand related error (Ashcraft, 1992).In this type of error the incorrect answer given is correct for another problem that shares an operand (e.g., 8 7 = 40 is given, and 40 is the correct answer to 8 5).Campbell and Graham (1985) found that this type of error accounted for over 79% of the errors made by subjects.Also, 85% of the errors made by brain-damaged patient PS were of this type (Sokol, McCloskey, Cohen, & Aliminosa, 1991).
When an operand error is made, the incorrect answer not only shares one operand, but usually, the other operand is only o by a magnitude of 1 or 2 (e.g., 9 7 = 72).This phenomenon, termed the operand distance e ect, was present in 95% of the operand errors made by patient PS and 60% of those made by patient GE (Sokol et al., 1991).

Table Errors
In a table error, the answer does not share an operand with a correct answer, but the answer given does reside in the multiplication table (e.g., 6 9 = 56, when 56 is the correct response to 7 8).Campbell and Graham (1985) found that this type of error accounted for 13% of the multiplication errors made by adults.2.2.3.Operation Errors Operation errors, also termed cross-operation confusions (Ashcraft, 1992), occur when the answer given is correct for a problem having the same two operands, but a di erent operation (e.g., 9 8 = 17).These cross-operation confusions increase response time in production (e.g., Campbell, 1987a;Campbell & Clark, 1989;Campbell & Graham, 1985;Miller et al., 1984) and veri cation tasks (e.g., Ashcraft & Battaglia, 1978;Winkelman & Schmidt, 1974;Zbrodo & Logan, 1986).In one study, this mistake accounted for 24% of the errors made by normal adults (Miller et al., 1984).This e ect caused 69% of the errors of brain-damaged patient GE (Sokol et al., 1991).

Non-table Errors
A non-table error occurs when the answer given is not an answer to any problem in the multiplication tables (e.g., 4 9 = 38) and is not frequently committed by normal adults.Campbell and Graham (1985) found that this type of mistake accounted for only 7% of all errors made by subjects.This error is also rarely committed by brain-damaged patients (Sokol et al., 1991).

Priming E ects
Multiplication problem response time and accuracy can be a ected by priming.Correct responses to recently practiced arithmetic facts are given quicker than other correct responses (Campbell, 1987b;Stazyk et al., 1982).Erroneous responses have also been found to relate to both positive and negative priming (Campbell & Clark, 1989).Positive error priming is shown when previous answers (usually given two or three trials before) occur as errors with a probability higher than chance.Negative error priming is shown when a particular incorrect response is less likely to occur if it is the answer given for an immediately preceeding trial.Errors of operand intrusion, when an operand is erroneously repeated in an answer (e.g., 4 8 = 28), may also arise from priming (Campbell & Clark, 1992).

Strategies and Rules
Multiplication, unlike addition, is not easily accomplished by a counting approach, but other rules and strategies can facilitate production or veri cation.The generalized rule of commutativity greatly reduces the number of distinct problems to be memorized: If the answer to 8 6 is forgotten, the answer to 6 8 can be used.
There are several ways to verify the plausibility of a stated answer, for example when one of the operands is 5, the product always ends in 0 or 5.
Generalizations for multiplication by 0 and 1 are perhaps the most frequently applied (i.e., 0 N = 0 and 1 N = N ).Problems including the operands of 0 or 1 have been shown to exhibit patterns of response time and error e ects di erent from other problems (Aiken & Williams, 1973;Ashcraft, 1982Ashcraft, , 1992;;Parkman, 1972;Stazyk et al., 1982).Neuropsychological data lend support to the theory that multiplication by zero is stored as a rule.PS, a brain-damaged patient, performed quite di erently on zero problems than on non-zero problems (McCloskey, Aliminosa, & Sokol, 1991;Sokol et al., 1991).

Representation of Number
The mental representation of numerical quantities may provide bases for the effects described above.Moyer and Landauer (1967) rst suggested that an analog magnitude representation is included in the concept of number.The symbolic distance e ect, often called the split e ect, demonstrated by a decrease in time to choose the larger of a pair of digits as the di erence between them increases, was interpreted to imply magnitude representation.Moyer and Landauer observed the similarity between the results from numerical comparison tasks and the time for discriminations along perceptual continua (e.g., brightness or weight) as characterized by Weber-Fechner laws.This observation led them to suggest that numbers are internally represented as magnitude analogs that are, approximately, a logarithmic function of digit size.Further studies of normal and brain-damaged subjects support a duality of number representation: analog and digital.Magnitude is commonly considered to be analog in nature, and compressive as numbers increase in size (Banks & Hill, 1974;Dehaene & Cohen, 1991;Michie, 1985;Todd, Barber, & Jones, 1987;Warrington, 1982).This number scale compression could make larger numbers less discriminable, and thus contribute to the problem size e ect.
Other investigations of the number concept have uncovered further complexity.The application of multidimensional scaling to data from number similarity judgments has yielded dimensions of \magnitude," \odd versus even parity," and \prime versus composite numbers" (Shepard, Kilpatrick, & Cunningham, 1975), and also shown native language in uences (Miller, 1992).Campbell (1994) found interaction e ects between two number formats (Arabic digit and English number-word) and problem size e ects, operation errrors, and operand-intrusion errors.Inter-trial error priming e ects were also di erentially a ected by number format.These results were interpreted as suggesting \notation-dependent" activation of number facts, and \interpenetration of number reading and number-fact retrieval processes."

Connectionist Models of Multiplication Number Fact Retrieval
All three models reviewed in this section adhere to the connectionist tenet of distributed representation and employ a nonlinear retrieval mechanism.However, they vary in motivation, scope, and emphasis.Anderson's bsb applies the simplest of structures and algorithms to learning number facts, thus showing the power of neural networks to learn, and generalize, through massive parallelism.The backpropagation model of McCloskey and Cohen (1989) serves mainly to highlight a learning problem of some models.McCloskey and Lindemann's mathnet uses a probabilistic approach to simulate number fact retrieval within the framework of a broader modular model of numerical processing.
We will pay particular attention to the number representation of the models.If the data representation is not appropriate, a network does not learn well, even with the most powerful learning rules.Determining a scheme to transform a problem such as 4 7 = 28 into a model representation is not trivial.Numbers perceived as being similar should be represented so as to be similar in structure, and numbers perceived as di erent should be dissimilar in structure.If this is not the case, the mapping to be learned by the network becomes quite complicated.Moreover, the structure of the pattern (typically expressed as a vector) should re ect some psychological validity.
3.1.The Anderson Brain-State-in-a-Box Model J. A. Anderson and his associates (Anderson, 1992;Anderson, Spoehr, & Bennett, 1994;Viscuso, 1989;Viscuso, Anderson, & Spoehr, 1989) have employed a simple neural network model (i.e., the bsb) to generate many of the basic number fact e ects of human arithmetic learning.The model represents data with large highdimensionality vectors, which allow for manipulating the amount of correlation between stimuli, therefore making it possible to use simple learning and retrieval algorithms.
The bsb couples a classic associative memory with nonlinear retrieval dynamics 1 .The associative memory links a set of \neuron-like" units to itself, and is therefore termed an auto-associator.Such systems are often used in pattern recognition applications, because associating a pattern to itself allows the regeneration of a complete pattern response as output when only a partial or a degraded copy of a stored pattern is input.This property of the auto-associator can be applied to multiplication number facts by rst training the network to associate the pattern representing a problem (e.g., 6 7 = 42) to itself in a learning phase.In a later test phase, a partial stimulus (e.g., 6 7 =) can be presented as input and the network will complete the pattern, thus supplying the answer.

Architecture of the Model
The architecture of the bsb consists of one layer of units that connect to themselves as illustrated in Figure 1.The connection weights between units are bidirectional and symmetric.The units may be fully connected, as illustrated in the gure, or only partially connected by randomly setting some of the weights to 0. Anderson and his colleagues have frequently used 50%, or less, connectivity.Partial connectivity The architecture of an auto-associator comprised of six units.Each unit is connected to all other units providing feedback when input is presented does not qualitatively a ect the network performance, but reduces computational time, and provides some increase in biological realism (Anderson, 1995).

Multiplication Problem Representation
The problem representation, called \state vector coding," is a hybrid scheme rst described by Viscuso (1989) and Viscuso et al. (1989) that includes di erent facets of number representation.The general layout of a state vector, which represents a multiplication number fact, is illustrated in Figure 2.Each operand and the product are constructed of two parts, one part being an arbitrary abstract symbolic representation and the other part being a sensory representation that is roughly analog.This analog representation is, according to Anderson (1992Anderson ( , 1995)), responsible for much of the bsb performance.
The analog portion of the bsb state vector, provided for each operand and the product, re ects the magnitude of the number with a bar consisting of a series of 1.

Change 0's to -1's Convert to ASCII binary notation
Convert to ASCII decimal notation 's' 115 Prefix by an even parity bit +1's, placed within a eld of ?1's.The position of the bar shifts from left to right as the number to be represented increases in magnitude.The bars are positioned within the magnitude eld in a staggered fashion so that those representing numbers of similar magnitude will partially overlap.The width of the bars can be adjusted to represent the correlation between nearby numbers (Anderson, 1995).
The symbolic portion of the representation maps the spelling of a number word (e.g., four, twenty) into ?1'sand +1's.This is accomplished by rst converting each letter into a numeric value as it would appear in ascii code.For example, as illustrated in the top of Figure 3, the letter `s' becomes 115 in decimal notation.The ascii value is then expressed binarily as a string of seven 1's and 0's, a parity bit 2 is added to the front of the string, and nally all 0's are changed to ?1's.These eight character patterns for each letter are concatenated following the spelling order of the number word.This process is detailed at the bottom of Figure 3 for the symbolic representation of the word \one."This particular representation is, in fact, arbitrary: Other symbolic coding schemes could be used to achieve the same e ect.The Anderson symbolic representation has the advantage of being easily decoded on the output side, in a reverse fashion, a ording human interpretation of the network response to a problem.Activation values exceeding a threshold in absolute value are set to the closest limit, either +1 or ?1.This is done only in the nal decoding of the network response and serves to eliminate inconsistent and noisy answers.The ?1's are then changed back to 0's, and the rightmost 7 binary digits for each letter are converted to the equivalent ascii letter (or other symbol).Thus, the bsb output is converted into a \spelled-out" response.

Operation of the Model
The model contains two phases: a learning phase and a testing (i.e., retrieval) phase.
Learning phase.First, the learning phase establishes the weights for each connection between units of the auto-associative memory.Using either standard Hebbian or Widrow-Ho techniques, each representation of a multiplication number fact is associated with itself (see Anderson, Silverstein, Ritz, & Jones, 1977; Abdi, 1994a for details).
Retrieval phase.After the learning phase has completed the creation of the connection matrix, the testing phase is initiated.In this phase, an incomplete or degraded version of a learned pattern, or a novel (i.e., never learned) pattern, is fed to the network by setting each unit to one element of the pattern.Each unit calculates a response by combining its initial activation with the feedback arising from \ ltering" the activation of all connected units through the weighted connections (see Figure 4).The units compute the sum of the products of all inputs times their respective weighted connection values, and adjust their current activation level by incorporating this sum.Both the unit activation, and the feedback, may be scaled by a decay constant.Optionally, a multiple of the initial stimulus is also incorporated (termed clamping of the input).The entire summation process (shown by the arrows in Figure 4) constitutes one cycle (i.e., time step) of retrieval.This feedback cycle is done repeatedly, and the nal state represented by the activation values for all units is taken as the response of the model.
A limit function prevents unit activation from growing boundlessly.If unit activation exceeds an upper limit (e.g., +1), or falls below a lower limit (e.g., ?1), it is set to that limit.This clipping forces the network state to stay within a hypercube, hence the \in-a-Box" part of the name.When all units reach their limit, a stable state has been attained.
Under certain conditions, with repeated feedback cycles, the activation of each unit is drawn toward the limits of the clipping function, so that the state approaches a vertex of the hypercube (see Golden, 1986).Therefore, the network response can be assessed in terms of both speed (i.e., response time) and accuracy.Response time can be measured by the number of feedback cycles needed to reach stability, and accuracy by the similarity of the output to the correct answer (e.g., by taking the correlation or the cosine between the correct or \taught" vector pattern and the reconstructed vector).The bsb function is described in more mathematical detail in Appendix A. 3.1.4.Application to Multiplication Number Fact Recall Viscuso (1989) and Viscuso et al. (1989) applied the bsb model to the production and veri cation of qualitative3 multiplication, for which an approximation, or estimation, of the answer is given (e.g., 2 6 10 and 8 6 50). 4 Each problem of the qualitative multiplication table was constructed using a 640 element hybrid representation, part magnitude and part symbolic, as previously described.Results were correct for only over 50% of the problems, but comparable to human performance for associative interference, practice e ects, and the symbolic distance e ect.The model did well on zeros problems, appearing to extract this \rule."A confusion error matrix showed that, in general, wrong answers clustered around the correct magnitude.Moreover, errors were of an associative nature, comparable to those of human performance as observed by Campbell and Graham (1985) and Norem and Knight (1930).Both human subjects and the model provided large products for problems containing a nine as an operand, and frequently confused problems with sixes and sevens as operands.
Frequency e ects.To simulate the in uence of practice e ects (i.e., frequency of exposure to certain problems) on the model, Viscuso (1989) and Viscuso et al. (1989) biased the connection matrix to give more importance to particular problems.This was accomplished by modifying the Widrow-Ho learning rule for particular problems by multiplying their representations by a scaling constant set to 1.5 (i.e., using vectors of values +1:5 and ?1:5 instead of +1 and ?1 in the learning phase).Problems having the largest products, speci cally those in the 6 to 9 times tables, were treated this way.In the retrieval phase, wrong answers to other problems re ected this bias and were larger in magnitude than they were without this treatment (e.g., 2 3 = 20).These results are similar to a practice e ect shown with human subjects: Answers to extensively practiced problems are more likely to be given as incorrect answers to other problems (Campbell, 1987b;Norem & Knight, 1930).
Order e ects.In all the simulations described above, stimuli were presented repeatedly in the learning phase, but always in randomly mixed order.This is not the way most children are presented with number facts.Addition is usually presented prior to multiplication, and multiplication tables are typically presented and practiced in order from small to large.An attempt to simulate order e ects with the bsb illustrates an interesting problem.Viscuso (1989) notes that when the 30 learning presentations for one problem were presented sequentially to the Widrow-Ho learning algorithm, a very strong \recency interference" e ect arose (an e ect now often termed catastrophic interference).For example, if the last problem learned was 9 9 = 80, the network tended to give the answer 80 to many other problems.The tendency for the sequentially trained bsb to corrupt previously learned material with new material is much stronger than the sequential and positive error priming e ects observed in human subjects.
Other E ects.In more recent simulations (Anderson et al., 1994) the bsb performance improves to 70% average correctness, due to changes in the model representation, and a slight modi cation of operation.The number of elements in the problem representation is increased to 1266, with 422 elements allotted to each operand and the product.The symbolic part for each number is 72 elements, leaving 350 elements for the analog portion, which contains a 78 element bar.The bar consists of mostly +1's (about 13 ?1'sare randomly placed to provide some noise) and is positioned in a eld of all zeros.Bars representing numbers of similar magnitude overlap in a compressed manner, roughly logarithmic, as magnitude increases.During the retrieval phase, the bsb algorithm is applied as before, except that the operands of the problem are always clamped, and that an activation threshold is used to keep the zero parts of the vector at zero.Anderson et al. (1994) succesfully simulated the e ects of symbolic distance and priming, and with somewhat less success the generalization of learning.Priming, implemented by increasing connection weights for certain problems previously answered correctly, subsequently decreased the number of iterations (i.e., response time) to answer these problems.These priming e ects were generalized, to a lesser degree, to related problems sharing operands or answers.To test its ability to generalize to the \rule" of commutativity the network was trained with a problem in one order (e.g., 7 9) and then tested with the opposite order (e.g., 9 7).The network response to the new order was correct 50% of the time, and always twice as slow as the response to the learned problem.Thus, although generalization was accomplished to some degree, it was far from perfect.However, when a problem was omitted from the training set and this \novel" problem was tested, the response was typically correct, and required only a few more cycles than the responses to the learned problems.

Discussion
The strength of the bsb model is to show that a simple and understandable neural network can simulate number fact e ects for multiplication.Speci cally, the model simulates the human phenomena of associative interference and problem size in production, the split e ect for veri cation, priming e ects, and some ability to generalize learning.The representation used for the problems, with its emphasis on magnitude relationships, is the key to many of these e ects, and suggests an explanation for the problem size e ect other than order of learning and frequency.However, while not discounting the achievements of the model, it may be instructive to consider its weaknesses.
The problem size e ect, which is extremely robust in human subjects, is not very robust in the bsb model (Anderson et al., 1994).The e ect was best simulated when the bar coding was compressed logarithmically.Although frequency or practice e ects were generally simulated successfully, a simulation of order e ects produced an overwhelming recency e ect.The question is: Is this problem unique to the bsb or is it characteristic of other neural network models?McCloskey and Cohen (1989) further investigate the problem of what they term \catastrophic interference" when serial learning is simulated by a three layer back-propagation neural network.

The McCloskey and Cohen Back-propagation Model of Arithmetic Learning
To investigate the e ects of sequential learning on a back-propagation neural network, McCloskey and Cohen (1989) modeled the learning of number facts.What is most interesting about this model is not its successful predictions, but its failure to simulate the human ability to sequentially acquire and retain information.Although newly learned associations may degrade performance on previously learned material (i.e., retroactive interference), the decline is not \catastrophic."In this model, sequential presentation of stimuli during learning results in network \retrograde amnesia."

Architecture of the Model
The standard back-propagation model used by McCloskey and Cohen (1989) contains three layers consisting of 28 input units, 50 hidden units, and 24 output units (see Abdi, 1994b for more details about this type of architecture).

Multiplication Problem Representation
McCloskey and Cohen (1989) used a \coarse-coded" representation for numbers.As illustrated in Figure 5, each number between 0 and 9 inclusive is represented by 12 units.In a manner similar to that used by the analog portion of the bsb representation, three consecutive units are set to +1 with the remaining nine having a value of 0. As the size of a number increases, the bar of +1's moves to the right.The input stimulus vector is constructed by concatenating these number representations, one for each operand, with a four unit operation representation.
The correct response for an output vector is formed by the concatenation of two number representations, one for the tens digit of the answer and the other for the ones digit.

Operation of the Model
This standard three layer back-propagation model has a learning phase and a retrieval phase.After the learning phase has established weights for the connections, retrieval is tested.The pattern of activity levels of the output units is interpreted as the answer to the problem.

Application to Multiplication Number Fact Recall
When the network was trained concurrently on the 200 single digit addition and multiplication number facts, recall was virtually perfect.Such high performance is not unusual for back-propagation models.Next, McCloskey and Cohen (1989) investigated the e ects of sequential training and compared their model predictions to the results obtained in a classic learning study that showed a clear e ect of retroactive interference.In this study by Barnes and Underwood (1959), subjects learned lists of eight paired nonsense syllables and adjectives using an A{B/A{ C paradigm.After perfectly learning a rst list pairing nonsense syllable A with adjective B, they were presented with a second learning list, which paired the same nonsense syllable A with a di erent adjective C. Subjects were tested after various numbers of practice trials (1, 5, 10 or 20) on the second list and asked to recall both the B and the C response when shown A. Figure 6 illustrates how recall of the two lists changed as a function of practice trials on the second list.As performance on the second list improved, performance on the associations of the rst list steadily declined to about 50 percent recall.However, note that although the subjects demonstrated some forgetting of the rst list, they did retain some information.
McCloskey and Cohen (1989) rst trained the network to respond correctly to 17 ones addition facts (i.e., 1 + 1 through 9 + 1, and 1 + 2 through 1 + 9).Training on the twos addition facts (i.e., 2 + 1 through 2 + 9, and 1 + 2 through 9 + 2) was then initiated.Recall of both sets of number facts was tested after each learning iteration for the twos facts.A result of this testing is illustrated in Figure 7, showing rapid and almost complete \forgetting" of the ones facts, with performance declining from 100% to 57% after only one learning trial.
McCloskey and Cohen (1989) report on a series of experimental manipulations of the network parameters in an attempt to isolate the cause of this problem: changing the number of hidden units, changing the learning rate parameter, overtraining on the rst list, freezing of some weights, changing of target activation values, and changing representation of the stimuli.None of these manipulations caused the results to approach human performance.McCloskey and Cohen, expressing a somewhat pessimistic view of the capability of neural networks to model human cognition, conclude that networks of this type can not handle a sequential training regimen, because new learning will always modify the weight con guration and thus change the solution space.This modi ed space may no longer be compatible with previously learned material.These results, raising some concern about the wisdom of basing models of human learning performance on this type of architecture, have led to further investigation of this phenomenon.

Further Investigation of the Model
One of the drawbacks of a layered connectionist model, such as that of McCloskey and Cohen (1989), is that it is not easy to attribute speci c results to particular characteristics of the network.However, Lewandowsky (1991Lewandowsky ( , 1994) ) has undertaken such a task to identify the cause(s) of, and possible remedies for, catastrophic interference in this type of model.His results provide a somewhat more optimistic outlook for the cognitive modeling capabilities of neural networks.Lewandowsky (1991), concentrating primarily on the abrupt steepness of unlearning in sequentially trained distributed models, shows the main cause to be the nature of the representation used.By creating stimuli as random valued zerocentered vectors (i.e., the expected value of correlation between vectors is 0) and applying a continuous retrieval measure (e.g., cosine), he produced a more gradual unlearning using the same network as McCloskey and Cohen (1989).However, another problem still remains.Even though unlearning is no longer as rapid, it continues to be practically complete when the mastery of the second list is achieved.Also, using random vectors fails to capture an important feature of arithmetic fact stimuli: The facts are naturally correlated, giving rise to confusion e ects (e.g., consider 4 8 = 32 and 4 6 = 24 which have one operand in common).Therefore, another approach is needed to eliminate the sequential learning problem.
In a more recent article, Lewandowsky (1994) has reported on possible alternative solutions.First he notes that the ability to generalize, as well as the tendency for catastrophic interference, arise from a distributed representation.Thus, solutions for catastrophic interference that create less than fully distributed representations result in a decline of generalization ability.In contrast, solutions that modify the learning rule can reduce the overlap between internal representations (i.e., those created by the hidden units), and also maintain generalizability.One such solution, the novelty rule, uses only the di erences of a new stimulus from previously learned material to update connection weights.A drawback to this approach, however, is that it is only applicable when the input and the output are the same, as in an auto-associator.Another solution, termed activation sharpening, reduces the overlap of (i.e., de-correlates) internal representations by raising the activity level of the hidden units that are already the most active, and decreasing the activity level of all the other hidden units (French, 1992).This solution can be applied even when the input and the desired output di er.

Discussion
To conclude, in addition to the severe interference di culties encountered in this standard back-propagation model when sequentially trained, there are other drawbacks.First, because of the back-propagation algorithm, the model must be given the answer to be able to learn.Second, learning itself is quite computationally intensive and can require thousands of iterations.
A further problem with this model is its failure to replicate another aspect of human behavior.Like the bsb, this model is deterministic (for a given stimulus and set of connections, the model always produces the same output).In contrast to human subjects, the model will never make an occasional mistake.Educated adults do make occasional mistakes when tested on basic number facts (e.g., under speeded testing, adults make errors in their responses to single-digit multiplication problems about 7.7% of the time, Campbell & Graham, 1985).The nal model to be reviewed, mathnet by McCloskey and Lindemann (1992) simulates this human behavior.

The McCloskey and Lindemann mathnet Model
mathnet is embedded within a general model of numerical processing (see Figure 8).This general model, rst proposed by McCloskey et al. (1985), arises from dissociations observed in the numerical processing of brain-damaged patients (e.g., McCloskey, Aliminosa, et al., 1991;McCloskey et al., 1985;Warrington, 1982).mathnet implements the arithmetic fact retrieval component of this general model.
The implementation of mathnet, although in some ways similar to that of the McCloskey and Cohen (1989) model, contains some interesting di erences, particularly in the areas of the retrieval process and the training regimen that is used.The architecture of the mathnet model is also somewhat more complex.3.3.1.Architecture of the Model Like the McCloskey and Cohen (1989) model, mathnet has a three layer structure consisting of 26 input units, each connected to all of the 40 hidden units, which are in turn connected to all of the 24 answer units.However, for mathnet, unlike backpropagation models, the answer units are also interconnected.All the connections are bidirectional and symmetric.Also, each hidden and answer unit is provided with a bias (i.e., a tendency toward a positive or negative activation level).

Multiplication Problem Representation
The mathnet representation for multiplication number facts is like that used by McCloskey and Cohen (1989) as illustrated in Figure 5 with two exceptions: 1) a ?1 and +1 activation level is used instead of 0 and +1 and 2) the operation is coded by only two units (set to ?1 and +1 for multiplication) instead of four units.

Operation of the Model
As described so far, this model does not appear to be extremely di erent from that of McCloskey and Cohen (1989).It is in the retrieval and learning processes that a major di erence appears.Speci cally, in contrast to the back-propagation model, which feeds forward simultaneously to calculate concurrently the responses (activation levels) of all units in a layer on each iteration, mathnet employs an asynchronous technique, calculating the response of each unit in turn, in a random order fashion.During this activation update the bidirectional connections enable the answer units to a ect hidden units, as well as themselves.
Retrieval process.
To retrieve an answer to a multiplication problem, the network is rst initialized by setting the activation level of the problem units to the problem representation and that of the hidden and answer units to 0. During the retrieval process the activation of the hidden and answer units is modi ed, but the problem units are clamped (i.e., remain unchanged and always supply input).Simulated annealing is then used.This term is adapted from a process of heating and gradually cooling materials, such as glass or metals, to free them from internal stress.For mathnet, this process iteratively calculates activation values for each unit that is free to vary.On each iteration, the activation values of all the hidden and answer units are updated one at a time in a random order.This asynchronous updating of unit activation provides the ability of mathnet to arrive at di erent answers for the same problem, because the activation value attained by a particular unit depends upon which other units have been updated previously.When the annealing schedule is complete, the problem answer is determined by scoring the tens digit and ones digit responses of the answer units separately, by taking the best match to the representations used for the digits 0 to 9.
Figure 9 (see also Appendix B) presents an alternate view of the speci c mathnet architecture.This gure illustrates the three sets of layer connections as matrices: W is a 26 40 matrix connecting each problem unit to each hidden unit, Z is a 24 40 matrix connecting each hidden unit to each answer unit, and A is a 24 24 matrix connecting the answer units to themselves.Recall that in the mathnet architecture the connections are bidirectional.
To provide a speci c example, Figure 9 illustrates the activation update of one particular hidden unit, hidden unit 4. Input to the activation update comes from three sources: 1) the current activation levels of each problem unit multiplied by its connection weight to hidden unit 4, 2) the current activation level of each answer unit multiplied by its connection weight to hidden unit 4, and 3) a bias weight.
Learning phase.As in the models previously described, a learning algorithm is used to create the weight values assigned to each connection.Mathematically, the retrieval process just described, and the learning process, are both based on the mean eld theory technique.More information on this technique, and its relationship to other network approaches, can be found in Peterson and Anderson (1987), Peterson & Hartman (1989), and Haykin (1994).Appendix B gives more detail on learning and retrieval.

Application to Multiplication Number Fact Retrieval
To ensure robustness of the mathnet model, McCloskey and Lindemann ran three separate simulations using the same architecture.Di erent random connections were assigned initially, as well as a di erent order of presenting problems on each learning cycle, and a di erent order of unit selection for updating on each iteration of the annealing schedule.The authors report a unique training regimen, which simulates human experience, and also appears to eliminate the problem of catastrophic interference.
Training regimen.During the learning phase, mathnet was presented with 64 di erent multiplication number facts (i.e., 2 2 through 9 9).Two factors often posited to be the cause of the problem size e ect in human subjects, order of learning and frequency of presentation, were simulated.The training of the network began with an ordered presentation of stimuli.Eight sets of stimuli were presented for ve learning cycles each.The order of the problem presentation for a set was varied randomly in each cycle.The rst training set contained the problems with an operand of 2, which are called the 2's problems.The second training set included the 2's problems and also 3's problems, excepting those 3's facts that were previously contained in the 2's set (e.g., 2 3).The third set included 2's, 3's, and 4's problem, and so on.The new problems in each set were included twice each, and the old problems just once.
Following the ordered training, all 64 problems were presented to the network in one set to simulate frequency e ects.In this nal training set, the inclusion frequency of a particular problem was based on a size class, with problems classi ed as \small" included more often than those classi ed as \large." In the learning phase, performance of the networks was tested every ve learning cycles until the answers produced in the free phase were totally correct, requiring 123 learning cycles on the average across all networks.Then, testing was done by presenting the 64 problems to the 16-iteration annealing schedule until equilibrium was reached, and the answer scored.This test was done ten times, and the results for all networks were almost perfect (only one problem out of 640 was missed).Thus, mathnet demonstrated the ability to learn the times tables in less training time than would be required by back-propagation.
Speeded testing e ects.To simulate the pressure of speeded testing conditions in human subjects, McCloskey and Lindemann (1992) shortened the annealing schedule used in testing, by omitting the rst ve iterations and discontinuing iterations as soon as a stable state was achieved.Each of the three networks was tested 30 times on the set of 64 problems.This speeded testing resulted in occasional errors, reducing the mean accuracy to 97.3%.Examination of the types of errors made shows that 79% were operand errors, matching exactly that found in human subjects by Campbell and Graham (1985), 5% were table errors, and 16% were errors classi ed as non-table.Moreover, in 91% of the operand errors, the non-shared operand was within one unit of the correct operand, thus simulating the operand distance e ect.
Problem size e ects.To determine if the network showed a problem-size e ect, response time was correlated to problem size and to error rate.Correlation of the network response time (number of iterations to reach stability) and the sum of the problem operands as a measure of problem size was .69(p < :001).This compares well to studies of human subjects, which have found correlations ranging from .6 to .8 (Campbell & Graham, 1985;Miller et al., 1984;Stazyk et al., 1982).The correlation of the network error rate and the sum of problem operands was .52 (p < :001) as compared to Campbell and Graham's (1985) correlation of .63 for human subjects.
To investigate the cause of the problem size e ect found in mathnet, McCloskey and Lindemann (1992) varied the training regimen for additional networks.No problem size e ect arose when networks were trained without the order and frequency manipulations.Likewise, various ordered training regimens without a subsequent frequency manipulation did not yield the problem size e ect.In fact, the problem size e ect was only shown when frequency alone was manipulated in the same manner as in the original regimen.Therefore, the authors conclude that frequency is the main determinant in the mathnet problem size e ect, as was suggested by Fendrich et al. (1993) for human subjects.
Lesioning e ects.McCloskey and Lindemann (1992) simulated brain-damaged human subjects, \lesioning" mathnet by reducing each connection weight by a random percentage of its magnitude.Their results showed de nite impairment, with accuracy of the networks declining to 69%, 86%, and 81%.Also, similar to brain-damaged human performance, the impairment was non-uniform, with correct answers given to some problems on every test, mistakes made on other problems in only some of the tests, and other problems consistently missed.The network also showed a problem size e ect on errors, but this was considerably weaker than for human subjects.As found in human studies the percentages of types of error varied across networks.One of the networks yielded a large percentage (34%) of non-table errors as is occasionally observed in patients (e.g., McCloskey, Harley, et al., 1991 report results of this type for two patients).
Several di erences between the performance of the lesioned mathnet and impaired humans, in addition to the weak problem size e ect, are noted by McCloskey and Lindemann (1992).One of these discrepancies is the performance on complementary problems (e.g., 8 6 and 6 8).Unlike most patients, the network does not show similar error rates between the two related problems, but only a weak tendency in this direction.In fact, each of the networks showed a very high error rate on one particular problem, and a very low error rate on its complement.Lastly, the network made only errors of commission, always arriving at some answer.Braindamaged subjects also make errors of omission, by failing to give any response at all (McCloskey, Harley, et al., 1991).
Additional mathnet lesioning simulations indicate that the location of lesioning, in addition to the amount of damage, must be considered when evaluating the simulation results (Lories, Aubrun, & Seron, 1994).In addition to con rming the McCloskey and Lindemann (1992) results, Lories et al. obtained interactions between location and error type, and between amount of damage and error type.
In brief, using the terms of Figure 9 : Increasing damage to W results in a shift from operand errors to in-table errors, increasing damage to Z induces out-of-table errors, and increasing damage to A induces operand errors.

Discussion
The mathnet model, embedded in a general framework that is based on dissociations demonstrated by brain-damaged patients, appears to be the strongest model of those reviewed in this paper for producing quantitative results comparable to human data.It performs strongly in its ability to learn multiplication number facts, as evidenced by the almost perfect results attained when the full annealing schedule was used for retrieval.Also, the speeded testing technique yields occasional errors with a pattern resembling that of human subjects.Lesioning tests generate network results somewhat similar to that of brain-damaged human subjects.However, there are certain shortcomings in this network as a model of human performance in multiplication number facts.
The weaknesses of mathnet include a relatively small task scope, and a failure to predict some of the phenomena observed in human subjects.The current focus of mathnet is quite narrow, including only production of answers, and excluding those problems containing 1's and 0's as operands.McCloskey and Lindemann (1992) state that they plan to enlarge this scope.The mathnet model shows an error percentage identical to that of human subjects on operand errors.However, it fails to match human performance for other types of errors, namely exhibiting proportions of table and non-table errors that are the reverse of human subjects.The network also fails to demonstrate the typical human subject advantages with the ve times table or problems that are ties.Perhaps most interesting is the failure of the network to extract and employ the principle of commutativity.It appears that brain-damaged human subjects may be able to exploit this complementary relationship to compensate for failure of retrieval on a speci c problem.The network does not show this capability.

Conclusion
In summary, the connectionist approach can predict the problem size e ect, and the typical human error pattern for multiplication number fact retrieval, as demonstrated most robustly by mathnet.Investigations of possible causes of the problem size e ect have shown frequency and number representation to be major factors in modeling.The probabilistic approach of mathnet even produces occasional errors, and simulates the performance of brain-damaged patients.Also, the connectionist approach does not preclude the prediction of human behavior that appears to be rule driven.The bsb and back-propagation simulations, which included zero operand problems, were able to extract the \rule" of zero multiplication.However, although some ability to generalize learning is demonstrated, none of the models reviewed extracts the principle of commutativity as well as even brain-damaged human subjects.Simulations of order e ects have not only failed to produce the problem size e ect, they have, except in mathnet, produced \catastrophic interference."Therefore, despite the successes of these models, some challenges still remain for the future.The ability to form a large set of speci c related associations and also generalize that learning does not as yet equal that of human subjects.Alternative solutions to the sequential learning problem of catastrophic interference call for further investigation.Also, the biological plausibility and psychological relevance of these architectures and algorithms remain somewhat in question.
Additional key issues, in regard to human cognition, are the role and the characteristics of number representation within the model architecture.These models illustrate a relationship between the components of a connectionist model: architecture, stimuli representation, and algorithm.The architecture can be quite simple, as in the one-layer bsb, or multi-layered, as in the back-propagation and mathnet models.In contrast to the bsb, with the number of representation units sometimes exceeding 1000, the three-layer networks employ less than 30 units.The richness of the bsb representation provides a wide ranging predictive capacity.In the more complex architectures, power is provided by internal representations derived in the hidden units from more complex learning algorithms.Although a meaningful interpretation of this representation remains to be done, de-correlation of these internal values (e.g., through activation sharpening) may o er relief from the problem of catastrophic interference.
A major issue for these, and other, models is the nature of numerical representation at the time of arithmetic fact retrieval.One of the features of mathnet, and a cornerstone of the general modular model, is the \abstract internal representation" of number.This representation has generated considerable controversy (see Campbell, 1992;Campbell & Clark, 1988;Campbell & Clark, 1992; but see McCloskey, Macaruso, & Whetstone, 1992).In contrast to an abstract representation, Campbell and his colleagues posit an \encoding-complex" memory representation, based on their empirical data.This view proposes that numbers are represented internally, and available for use in calculation, in closely associated multiple modalities.This concept of multiplicity of representation is also found in the Anderson bsb model, although in a simpler fashion.Debate on this issue drives to the heart of a core problem in modeling human behavior, namely the fundamental question of human mental representation of information.
Finally, what is the contribution of these models to the domain of cognitive studies?Although they simulate some of the known human data, they are not able to produce comparable results in all areas, nor have they predicted new phenomena.Therefore, in a strict sense, they cannot be considered as having the predictive power of a theory.However, they have served to help us examine the feasibility, or lack of feasibility, of speci c mechanisms to explain existing data.These attempts to simplify and formalize our existing knowledge have helped to pinpoint issues and gaps existing in current theories, especially those relating to mental representation, thus fueling new empirical research.

Figure 2 .
Figure 2. Organization of a state vector for a bsb multiplication problem.The representation of each number consists of two kinds of information, symbolic (S) and analog magnitude (A).The bar moves from left to right as the magnitude of a number increases.Bars for numbers close in magnitude will overlap in their position.

Figure 4 .
Figure 4. Operation of the bsb.The activity of the units is multiplied by the weights in the connection matrix W and the resulting feedback is added to the activity level of each connected unit.

Figure 5 .Figure 6 .Figure 7 .
Figure 5. Coarse-coded representation used by McCloskey and Cohen (1989).The top panel illustrates the 12 element representation used for each number from zero to nine inclusive.The bottom panel shows the construction of the input stimuli and the desired output, both of which use these numeric representations.

Figure 9 .
Figure 9.A view of the mathnet architecture illustrating the activation update of hidden unit 4. See Appendix B for more detail.