Application of implicit knowledge: deterministic or probabilistic

This paper distinguishes two models specifying the application of implicit knowledge. According to one model, originally suggested by Reber (1967), subjects either apply sufficient knowledge to always produce a correct response or else they guess randomly (High Threshold Theory; subjects only apply knowledge when there is sufficient knowledge to exceed a threshold ensuring a correct response); according to the other model, suggested by Dienes (1992), subjects respond with a certain probability towards each item, where the probability is determined by the match between the items structure and the induced constraints about the structure (Probability Matching Theory; subjects match their probability of responding against their personal probability that the item belongs to a certain category). One parameter versions of both models were specified and then tested against the data generated from three artificial grammar learning experiments. Neither theory could account for all features of the data, and extensions of the theories are suggested. Dienes and Berry (1997) argued that there is widespread agreement about the existence of an important learning mechanism, pervasive in its effects, and producing knowledge by unconscious associative processes, knowledge about which the subject does not directly have metaknowledge. Let us call the mechanism implicit learning, and the induced knowledge implicit knowledge. This paper will address the question of how implicit knowledge is applied. Imagine an animal learning about which stimuli constitute food. Maybe its mother brings back different plants or animals for it to eat, and stops it from ingesting other plants or animals, which may be dangerous or poisonous. The details about what features go together to constitute something that is edible may be more or less complex; in any case, the animal eventually learns, let us say perfectly, to seek out only the edible substances in its environment. But before learning reaches this point, how should the animal behave towards stimuli about which it has only imperfect information and when the mother is not around to provide corrective feedback? One strategy would be to only ingest stimuli about which the animal’s induced knowledge unambiguously indicates are edible; this would prevent unfortunate fatalities. On this scenario, implicit knowledge may have evolved in general to direct action towards a stimulus only when the implicit knowledge relevant to the stimulus unambiguously indicates the category to which the stimulus belongs. Partial information may not even be made available for controlling behaviour in case the animal uses the knowledge to sample poisonous substances. Thus, if the animal were forced to respond to some stimuli about which it had imperfect information, the animal would be reduced to random responding. We will call this the High Threshold Theory (HTT) metaphorically, the knowledge lies fallow until it exceeds a high threshold. Now imagine an animal learning about which locations are likely to contain food. This scene differs from the last in that a wrong decision is not so likely to have catastrophic consequences. If the animal’s knowledge merely suggests that a location contains food, it still may be worth sampling in preference to another. There are two different states of affairs to be distinguished in this case: First, different locations may have objectively different probabilities of containing food. The animal may come to have perfect knowledge of the different probabilities. It may seem rational for the animal to always choose the location with the highest probability, but this is not so if the probability structure of the environment is subject to fluctuations. Then it is beneficial to occasionally sample even low probability locations because it allows an assessment of whether the probability remains low. Also, because other animals will in general be competing for the food, and there will be more competitiors at higher probability locations, it therefore pays to forage sometimes at low probability locations. In fact, animals do forage in different locations according to the probability of finding food there (Stephens & Krebs, 1986) that is, they probability match. Similarly, in predicting probabilistic events, people like other animals are also known to probability match rather than respond deterministically. Reber and Millward (1968) asked people to passively observe binary events at a rate of two per second for two or three minutes. When people were then asked to predict successive events they probability matched to a high degree of accuracy (to the second decimal place; see Reber, 1989, for review). Second, different locations may have imperfectly known probabilities of containing food. This is the case in which we are interested; how does the animal apply its imperfect knowledge? The knowledge may be imperfect because the locations have not yet been sampled frequently, or because the location is new and the animal does not know how to classify it. Consider an animal faced with choosing one of two locations. In the absence of certain knowledge about the objective probability structure, one strategy would be to search each location according to the estimated probability. For example, in the absence of any information, the animal would search each location with equal probability, and adjust the search probabilities as information was collected. Or if features of the locations suggested different probabilities, search could start at those estimated probabilities. The true probabilities could be honed in on more quickly if the animal starts near their actual values. We will call the hypothesis that the animal responds to stimuli with a probability that matches the stimuli’s expected probabilties Probability Matching Theory (PMT). If we allow probabilities to apply to singular events, then animals may respond probabilistically to stimuli for which there are no objective probabilities other than 0 or 1. For example, in classifying an edible stimulus as either being palatable or unpalatable, the animal may ingest it with a probability determined by the likelihood that the stimulus is palatable. In this case, the probability of the animal ingesting a stimulus may be interpreted as the animal’s ‘personal probability’ that the stimulus is edible. We have outlined two hypotheses about the way in which implicit knowledge could be applied: HTT and PMT. It may be that implicit knowledge is applied in different ways according to context; for example, according to the cost of a wrong choice, as in the examples of ingesting stimuli and searching locations given above. Or it may be that evolution has for economy produced a mechanism with a single principle of application. We will investigate these two hypotheses HTT and PMT in one context, that of people implicitly learning artificial grammars. In that field, both hypotheses have been suggested as accounting for human performance. In a typical artificial grammar learning experiment, subjects first memorize grammatical strings of letters generated by a finite-state grammar. Then, they are informed of the existence of the complex set of rules that constrains letter order (but not what they are), and are asked to classify grammatical and nongrammatical strings. In an initial study, Reber (1967) found that the more strings subjects had attempted to memorize, the easier it was to memorize novel grammatical strings, indicating that they had learned to utilize the structure of the grammar. Subjects could also classify novel strings significantly above chance (69%, where chance is 50%). Reber (e.g. 1989) argued that the knowledge was implicit because subjects could not adequately describe how they classified strings (see Dienes & Perner, 1996, and Dienes & Berry, 1997, for further arguments that the knowledge should be seen as implicit). Reber (1967, 1989) argued for a theory of implicit learning which combined HTT with the Gibsonian notion of veridical information pick-up. Specifically, Reber argued that implicit learning results in underlying representations that mirror the objective structures in the environment. In the case of artificial grammar learning, he argued that the status of a given test item would either be known (and would thus always be classified correctly); or the status was not known, and the subject guessed randomly. On this assumption, any knowledge the subject applied was always perfectly veridical; incorrect responses could only be based on failing to apply knowledge rather than on applying incorrect knowledge. Reber used these assumptions by testing each subject twice on each string without feedback. If the probability of a given subject using perfect knowledge of the ith string is ki then the expected proportion of strings classified correctly twice, once, or not at all by that subject is determined by the following equations: the proportion of strings classified twice correctly = p(CC) = k + (1-k)*0.5 the proportion of strings classified correctly just once = p(CE) + p(EC) = (1-k)* 0.5 the proportion of strings never classified correctly = p(EE) = (1-k)* 0.5 where k is the average of the kis. Under this model, the values of p(CE), p(EC) and p(EE) averaged across subjects should be statistically identical and lower than p(CC). If p(EE) is greater than p(CE) or p(EC), then this is evidence that subjects have induced rules that are not accurate reflections of the grammar; the incorrect rules lead to consistent misclassifications. Reber (1989) reviewed 8 studies in which subjects in the learning phase were exposed to grammatical stimuli presented in a random order so that the grammatical rules would not be salient, and in which in the test phase subjects were tested twice. When subjects were asked to search for rules in the learning phase, p(EE) was on average .22 and the average of p(CE) and P(EC) was .13. That is, when subjects were asked to learn 

information, the animal would search each location with equal probability, and adjust the search probabilities as information was collected.Or if features of the locations suggested different probabilities, search could start at those estimated probabilities.The true probabilities could be honed in on more quickly if the animal starts near their actual values.We will call the hypothesis that the animal responds to stimuli with a probability that matches the stimuli's expected probabilties Probability Matching Theory (PMT).If we allow probabilities to apply to singular events, then animals may respond probabilistically to stimuli for which there are no objective probabilities other than 0 or 1.For example, in classifying an edible stimulus as either being palatable or unpalatable, the animal may ingest it with a probability determined by the likelihood that the stimulus is palatable.In this case, the probability of the animal ingesting a stimulus may be interpreted as the animal's 'personal probability' that the stimulus is edible.
We have outlined two hypotheses about the way in which implicit knowledge could be applied: HTT and PMT.It may be that implicit knowledge is applied in different ways according to context; for example, according to the cost of a wrong choice, as in the examples of ingesting stimuli and searching locations given above.Or it may be that evolution has for economy produced a mechanism with a single principle of application.We will investigate these two hypotheses -HTT and PMT -in one context, that of people implicitly learning artificial grammars.In that field, both hypotheses have been suggested as accounting for human performance.
In a typical artificial grammar learning experiment, subjects first memorize grammatical strings of letters generated by a finite-state grammar.Then, they are informed of the existence of the complex set of rules that constrains letter order (but not what they are), and are asked to classify grammatical and nongrammatical strings.In an initial study, Reber (1967) found that the more strings subjects had attempted to memorize, the easier it was to memorize novel grammatical strings, indicating that they had learned to utilize the structure of the grammar.Subjects could also classify novel strings significantly above chance (69%, where chance is 50%).Reber (e.g. 1989) argued that the knowledge was implicit because subjects could not adequately describe how they classified strings (see Dienes &Perner, 1996, andDienes &Berry, 1997, for further arguments that the knowledge should be seen as implicit).
Reber (1967,1989) argued for a theory of implicit learning which combined HTT with the Gibsonian notion of veridical information pick-up.Specifically, Reber argued that implicit learning results in underlying representations that mirror the objective structures in the environment.In the case of artificial grammar learning, he argued that the status of a given test item would either be known (and would thus always be classified correctly); or the status was not known, and the subject guessed randomly.On this assumption, any knowledge the subject applied was always perfectly veridical; incorrect responses could only be based on failing to apply knowledge rather than on applying incorrect knowledge.Reber used these assumptions by testing each subject twice on each string without feedback.If the probability of a given subject using perfect knowledge of the ith string is k i then the expected proportion of strings classified correctly twice, once, or not at all by that subject is determined by the following equations: the proportion of strings classified twice correctly = p(CC) = k + (1-k)*0.5 2 the proportion of strings classified correctly just once = p(CE) + p(EC) = (1-k)* 0.5 the proportion of strings never classified correctly = p(EE) = (1-k)* 0.5 2 where k is the average of the k i s.
Under this model, the values of p(CE), p(EC) and p(EE) averaged across subjects should be statistically identical and lower than p(CC).If p(EE) is greater than p(CE) or p(EC), then this is evidence that subjects have induced rules that are not accurate reflections of the grammar; the incorrect rules lead to consistent misclassifications.Reber (1989) reviewed 8 studies in which subjects in the learning phase were exposed to grammatical stimuli presented in a random order so that the grammatical rules would not be salient, and in which in the test phase subjects were tested twice.When subjects were asked to search for rules in the learning phase, p(EE) was on average .22and the average of p(CE) and P(EC) was .13.That is, when subjects were asked to learn explicitly they induced substantial amounts of nonrepresentative rules.When subjects were asked to memorize strings in the learning phase, p(EE) was on average .16and the average of p(CE) and p(EC) was .12.That is, when subjects were encouraged to learn implicitly, subjects only applied knowledge when it was largely (even if not entirely) representative, consistent with Reber's claim that knowledge only applied when it unambiguously indicated a correct classification.
In contrast to Reber, Dienes (1992;Dienes & Perner, 1996) suggested a connectionist model of artificial grammar learning that applied its knowledge according to PMT.The model became sensitive to the actual statistical structure of the environment, and to this degree embodied Reber's notion of representational realism.However, the network was a two-layer connectionist network and so could only become sensitive to first order statistical structure, and not higher orders; in this way, it's knowledge did not entirely mirror that of the environment.During the learning phase the network was trained to predict each letter from every other letter in the learning string.In the test phase, it attempted the same prediction.The activation of an ouput unit was determined by the correlation between the actual and predicted letters in the test string.The output activation directly determined the probability that the network would give a 'grammatical' response.Thus, the more each letter in the test string could be accurately predicted from the other letters, the greater was the probability of the network saying 'grammatical'.Dienes (1992) showed that for a set of stimuli that had been used in the past in human experiments by both Dulany, Carlson, and Dewey (1984) and Dienes, Broadbent, and Berry (1991), the proportion of subjects classifying each test string correctly was a reliable feature of the data across different experiments.The proportions for different strings varied between 37% and 82% with a good spread inbetween.Reber's model would interpret this variation in proportions as variation in the proportion of subjects who knew each string (with the values below 50% being due to the occasional nonrepresentative knowledge induced by subjects).Dienes interpreted the variation not as variation across subjects but as a variation across items in the probability of saying 'grammatical' (it was assumed that this probability was constant across subjects).In other words, subjects could be 'probability matching' (in quotes because there are no objective probabilities to match -each test string either is or is not grammatical).The model incorporated the assumption of 'probability matching' in that the activation of the output unit codes the degree that the test string satisfies the internal constraints induced by the grammar, and hence a sort of personal probability that the string is grammatical.Dienes (1992) showed that the connectionist network could successfully predict the probability of each test string being called grammatical (i.e. the proportion of subjects who called it grammatical), the overall classification performance of subjects, and the p(CC), p(CE), and p(EE) values averaged over subjects and strings.That is, Dienes showed that a model incorporating the assumptions of PMT could account for the same data that Reber (1989) had argued supported his HTT.The closeness of p(CE) and p(EE) could only be produced in the PMT model by having some proportion of items with a probability correct, p, well below .50.If p > .50, then p(CE) > p(EE) since p(1-p) > (1-p) 2 .
In summary, the data published so far do not discriminate the two theories of application of implicit knowledge, PMT and HTT.The aim of this paper is to directly test the two theories by testing each subject three times on a set of test strings and looking at the distribution of correct responses for each string separately.
Both models contain just one parameter which can be estimated from the overall proportion, P, of correct classifications for each item.For the Reber model, X can be estimated from the relationship X + (1-X)*0.5 = P (thus, X = 2*P -1), and for the PMT model p is estimated directly by P.
In some cases, the models make qualitatively similar predictions.For example, the Reber model predicts that p(3Cs) should be greater than p(2Cs, 1E) given that P > .70.Similarly, the PMT model predicts that p(3Cs) should be greater than p(2Cs, 1E) given that P > .75.Further, both models predict that for Ps close to .50, the distribution of responses should approximate a binomial with p=0.50.
The models also make qualitatively different predictions.For the Reber model, regardless of P, p(2Cs,1E) should always be equal to p(1C,2Es).For the PMT model, these probabilities will differ except in the special case that p = 0.5.If p > 0.5 then p(2Cs,1E) will be greater than p(1C,2Es).Also for the Reber model, regardless of P, p(1C,2Es) should always be equal to 3*p(3Es).For the PMT model, p(1C,2Es) will be greater than 3*p(3Es) whenever p > 0.5.
The models can also be fitted quantitatively to the actual distributions obtained.These qualitative and quantitative predictions are explored in three experiments.

Method
Participants.127 first year psychology majors from the University of Salzburg participated in a class experiment.There were 33 men and 94 women with ages ranging from 18 to 37 years.
Materials.The finite state grammar and stimuli used were the same as used by Reber and Allen (1978), and later by Dienes (1992), amongst others.
Learning items.The learning items were 3 to 5 letters long, and composed of the letters M, R, X, V, and T. The items were projected overhead so that the height of letters subtended about ½° to 2° of visual angle (depending on the distance in the auditorium).In addition to visually displaying the items, the experimenter spelled them out aloud.
Test items.Reber and Allen's (1978) set of test items was used with the following modifications.Any items that were part of the learning set and any items of less than 4 letters were omitted.Of the remaining items, 6 grammatical and 6 ungrammatical items were selected as "repetition items".The selection criteria were to find in each category 3 items 5 letters long and 3 items 6 letters long for which the classification accuracy in Dienes (1992) was around 67%.The eventually selected items had accuracies that varied between 57% and 78%.The final test list consisted then of 63 items: 27 non-repeated strings (11 grammatical and 16 ungrammatical) and 12 strings that were repeated 3 times each (making another 36 items).These items are shown in Table 1.Each of the repeated sequences was placed once in each successive third of the whole set.This placement was random with the restriction that for each item there were at least 10 intervening items before its repetition.

Procedure.
Learning.The class of 127 students was divided into three groups on the basis of seating arrangements.This resulted in 45 students in the No Exposure Group (no learning experience), 33 in the Single Exposure Group (single exposure to the learning set) and 49 students in the Double Exposure Group (two exposures to the learning set).After a general introduction that students are to participate in a memory experiment with three different groups, members of the No Exposure Group were sent out of the auditorium.The two groups remaining were given a recording sheet and received the following instructions: "I will now show you a list of letter strings which you should try to remember.To help you, I will show only 4 items at a time for a few seconds.Try to memorise these items and then reproduce them on your sheet of paper from memory.Do not start writing before the items have been covered up again!I will then show you the items again.Mark how many of them you had written down correctly without a single letter wrong.Then I'll show you the next four items, and so on."The timing of items was determined by spelling out each item at a pace of 3 to 5 sec per item.The exposure of a group of 4 items lasted, therefore, about 16 seconds on average.After the end of the 20 learning items the members of the Single Exposure Group handed in their recording sheets and then left the auditorium with instructions not to talk about the learned items among themselves or to the members of the No Exposure Group.
The remaining students (Double Exposure Group) were instructed to turn over their recording sheet for a second run through the learning list.After completion the recording sheets were collected, the other two groups were called back to the auditorium and the lists with test items distributed.
Test.The list of 63 test items was presented in one column stretching over the two sides of an A4 sheet.Students were given the following explanation and instructions: "The sequences you've just seen were produced by a system of rules, i.e., the grammar of an artificial language.I will now show you further sequences of this kind and you should decide for each item whether it belongs to the same artificial language (that is, has been produced by the same system of rules as the items you have studied) or not."For the No Exposure Group the following addition was made: "For you these instructions may seem nonsensical, since you have not yet seen any such sequences.However, it is often possible to just use one's feeling to classify items one way or the other: Some sequences just feel more natural than others.Try to do your best."All students were then told to put next to each item in column "A" their answer "y" if they think the item is grammatical or "n" if they think it is not.They should also provide an estimate of their confidence in column "C" by putting 50% if they are completely unsure of their judgement and 100% if they are absolutely certain or any percentage in between depending on the strength of their confidence.
At the very end students were asked whether (1) they had ever participated in a similar artificial grammar learning experiment before, whether (2) they thought they had formed any rules about the strings, and whether (3) they had noticed anything else about the strings (the hope was that this would bring students who had noticed the repetition of items to remark on it).

Results
Over all repeated and nonrepeated items, the No Exposure Group had a classification performance of 54% (SD=8%), the Single Exposure Group had a classification performance of 62% (SD=8%), and the Double Exposure Group had a classification performance of 66% (SD=8%).The groups differed significantly, F(2, 124) = 24.46.p < .005.That is, the amount of exposure in the learning phase influenced subsequent classification performance.
Appendix 1 shows the data for each repeated test string individually, displaying the number of subjects getting each string correct three times, twice, once, or not at all, and the fits of the PMT and HTT models.Some strings had Ps below .50,sometimes substantially so.This poses a problem for the HTT model outlined in the introduction because it only allowed subjects to know a string correctly or else guess.Thus, it was assumed that in cases where P was below .50,there was a proportion X of subjects who 'knew' a string incorrectly; i.e. a proportion X of subjects always gave the wrong response and the remaining (1-X) of subjects guessed randomly.For all other Ps, the assumptions given in the introduction were applied.
Qualitative comparisons.Qualitatively consistent with both theories all items from the learning groups with a P of .70 or greater had higher p(3Cs) than P(2Cs, 1E).
Quantitative comparisons.The fit between the models and the data was determined for each item in each condition by computing a chi-square based on the observed and predicted numbers of subjects classifying the item correctly exactly three times, twice, once, or not at all (see Appendix 1).Summing over items, the total chi-square values for the learning groups for PMT and HTT are 300.91 (df=72) and 135.96 (df=72), respectively, p << .005 in both cases.That is, the data departed significantly from both models, with a better fit for HTT rather than PMT.By a sign test over items, the fit was signficantly better for HTT rather than PMT, p = .0005.
In the No Exposure Group, the total chi-squares for PMT and HTT were 494.69 (df=36) and 273.72 (df=36), respectively; in the Single Exposure Group, the total chi-squares for PMT and HTT were 188.55 (df=36) and 62.31 (df=36), respectively; and for the double exposure group they were 112.36 (df=36) and 73.65 (df=36) respectively.Notice that for PMT, chi-squares approximately halved with each successive exposure; that is, the fit of the model became markedly better as learning progressed.For PMT, the Single Exposure Group was significantly better than the No Exposure Group, p = .006by sign test over items; but the Double Exposure Group was not significantly better than the Single Exposure Group, thus it is not clear that this increase in fit would apply to other items from the grammar.For HTT, the fit improved with one exposure of the learning items rather than no learning (p = .006by sign test), but levelled off after one exposure.
What is the reason for the HTT model's superiority?In 23 of the 24 cases in the learning conditions, there were more consistent responders than PMT predicted; i.e. both p(3Cs) and p(3Es) were higher than predicted.HTT fared much better, underpredicting p(3Cs) in only 13 out of 24 cases, and underpredicting p(3Es) in only 18 out of 24 cases.
In the No Exposure Group, both PMT and HTT underpredicted both p(3Cs) and p(3Es) in all 12 cases.In the No Exposure Group, the distribution consisted mostly of consistent responders, inconsistent with both models.

Experiment 2
The main findings of Experiment 1 are that PMT rather than HTT better predicted a greater proportion of subjects getting just two rather than just one trial correct, but that HTT provided the better overall quantitative fit by virtue of fitting p(3Cs) more closely than PMT.Both theories tended to underpredict p(3Es).The aim of experiment 2 was to replicate these findings when subjects were tested individually rather than as a group.

Method
Participants.Thirty students (three men and 27 women) from the University of Salzburg volunteered for this study.They were aged between nineteen and fifty-seven (M=27.47,SD=8.86).
Stimuli.The artificial grammar and the training and test strings were the same ones as used in Experiment 1 with some modifications: Letters used were M, R, T, X, and Z.The original letter V was replaced by Z because V was hardly distinguishable from U on the display.The set of twenty training strings remained otherwise unchanged.Test strings consisting of three letters were not used so that there remained twenty-one grammatical and twenty-two nongrammatical strings.Therefore one grammatical string (MTZR) was added.The four grammatical strings in the original test set that appeared in the training set were replaced (MZRXZT was replaced by MZRXZR, ZXZT by ZXZR, ZXZRX by ZXTZR, and ZXZRXZ by ZXZRXT).
A Power Basic program controlled the display of instructions, stimuli, and collection of responses.For each participant the training and the test strings were randomized anew by the computer so that no participant had the same sequence of training or test strings.The repeated test strings were positioned so that one instance appeared in each successive third of the test sequence.Within each third, the item was positioned randomly with the restriction that at least two intervening items must separate repetitions of the same item.
Procedure.All participants were individually tested.Participants were initially recruited to take part in a 15-min learning experiment.They were taken to an office and placed in front of a computer.Then all instructions were given on the screen.After having welcomed the participant and having asked demographic data (sex, age) the computer displayed: "This experiment is subdivided into two sections.In the first section you are to learn a number of items.Each item is made of several letters, e.g.'ABCDE'.The items appear separately one after the other in the middle of the screen.Each item stays visible only for a short time before the next item follows.During this time you are to learn the item by spelling it out loudly.If you have any questions about this section please ask the experimenter now, if not then concentrate your attention on the middle of the screen and push the space bar to begin!" After the participant had pushed the space bar the training strings were displayed as described.At this time the participants neither knew that the strings were produced by an artificial grammar nor what they had to do in the second part of the experiment.Each string appeared for three seconds on the screen in white letters on a black background.Then it disappeared and after a break of one second the next string appeared and so on.Having completed the training phase the participants were told: "You have now learned a number of items.Each of these letter strings is based on the same 'grammar', i.e., the order of the letters within each item was not determined by chance but by some underlying structure.The computer demonstrated graphically the described procedure.Then the participants were given the opportunity to practise handling the computer mouse by dealing with example strings consisting of random arrangements of the letters A, E, I, O, and U.They could decide themselves wether they wanted to practise or not.Practising could be broken off after each example.As soon as the participant decided to have practised enough the computer displayed: "If you have any questions about this section please ask the experimenter now, if not take the 'mouse' into your right hand and push the space bar with your left hand to begin! (Dealing with all items took about 10 min.)" After the participant had pushed the space bar the first test string was displayed.The test string itself was placed in the middle of the screen (as the training strings before) in white letters on a black background.The answer choice "grammatical" was written above "not grammatical".The "mouse pointer" was placed between these two words.After the participant's response the two answers choices (but not the test string) vanished and a scale (in white colour) appeared under the test string.This scale consisted of six equally spaced frames around the percentages 50, 60, 70, 80, 90, and 100.These frames were connected with straight lines.The "mouse pointer" was placed below the middle of the scale.The "mouse-click" could be set within each frame or on the lines between frames.After the response the screen was cleared (black screen) and after one second the next test string appeared.The participants did not get feedback about their classifications.They had as much time as they wanted for their answers.
After the sixty-eight test strings were completed the participants were asked: "Please answer the following question: In your opinion: How many of the items did you judge correctly?(Hint: If you only guessed then it should be about half of all items.)Of the 68 items you judged correctly: ____?(any number between 34 -68)" The participants could choose a number only between thirty-four and sixty-eight.Other values were not accepted by the computer.After this last input the computer thanked the participant and said goodbye.
Appendix 2 shows the data for each repeated test string individually, displaying the number of subjects getting each string correct three times, twice, once, or not at all, and the fits of the PMT and HTT models.
Qualitative comparisons.The overall pattern was the same as experiment 1. Qualitatively consistent with both theories, all items from the learning groups with a P of .70 or greater had higher p(3Cs) than P(2Cs, 1E).
Quantitative comparisons.The total chi-square values for PMT and HTT were 138.18 (df=36) and 69.29 (df=36), respectively, p < .005 in both cases.That is, the data departed significantly from both models, but the fit was closer for HTT rather than PMT (p = .039by sign test).
In terms of consistency of responding, in 11 out of 12 items, PMT underpredicted both p(3Cs) and p(3Es).HTT fared better with p(3Cs), underpredicting p(3Cs) in exactly 6 out of 12 cases, but underpredicting p(3Es) in 10 out of 12 cases.

Experiment 3
Experiment 1 had found model fits to improve as amount of exposure in the learning phase increased.Experiment 3 attempted to replicate Experiment 1 with more exposure in the training phase.

Method
The same method as Experiment 1 was followed with the following exceptions.Different groups of subjects were exposed to the training list once, twice, or three times (the Single Exposure, Double Exposure, and Triple Exposure Groups, respectively).Moreover, to intensify the learning experience, four new learning items (namely, VXTTTV, MTTTV, VXVRXR, MVRXM) were added at the end of the second repetition, and another four (VXRRRM, VXTV, MTVRXR, MVRXRM) at the end of the third repetition.

Results
Overall, the Single Exposure Group had a classification performance of 61% (SD=7%), the Double Exposure Group had a classification performance of 66% (SD=7%), and the Triple Group had a calssification performance of 69% (SD=8%).The difference between the groups was significant, F(2, 134) = 12.08, p < .01.That is, the amount of learning exposure influenced subsequent classification performance.
Appendix 3 shows the data for each repeated test string individually, displaying the number of subjects getting each string correct three times, twice, once, or not at all, and the fits of the PMT and HTT models.
Qualitative comparisons.Qualitatively consistent with both theories, all 17 items with a P of .70 or greater had higher p(3Cs) than p(2Cs, 1E).
Quantitative comparisons.The total chi-square values for PMT and HTT were 402.48 (df=108) and 219.38 (df=108), respectively, p < .005 in both cases.That is, the data departed significantly from both models, with a better fit to HTT rather than PMT (p = .039by sign test over items).
In the Single Exposure Group, the total chi-squares for PMT and HTT were 185.6 (df=36) and 99.09 (df=36), respectively, for the double exposure group they were 135.47 (df=36) and 67.98 (df=36) respectively, and for the triple exposure group they were 81.43 (df=36) and 52.31 (df=36) respectively.For PMT, the increase in fit from one to two exposures was not significant over items by sign test; nor was the increase from one to two exposures.For HTT, the changes were not significant either.
In quantitative comparisons, HTT did better than PMT because of its ability to predict consistent responders.In 30 of the 36 cases, PMT underpredicted both p(3Cs) and p(3Es).HTT on the other hand, underpredicted p(3Cs) in only 16 out of 36 cases, and underpredicted p(3Es) in 23 out of 36 cases.

Discussion
Experiment 3 replicated the finding that PMT rather than HTT better predicted a greater proportion of subjects getting just two rather than just one trial correct, but that HTT predicted p(3Cs) more closely than PMT.Experiment 3 further showed that the fit of both models improved with increasing learning (although not consistently for all items).

General Discussion
This paper has distinguished two models specifying the application of implicit knowledge.According to one model, originally suggested by Reber (1967), subjects either apply sufficient knowledge to always produce a correct response or else they guess randomly (High Threshold Theory; subjects only apply knowledge when there is sufficient knowledge to exceed a threshold ensuring a correct response); according to the other model, suggested by Dienes (1992), subjects respond with a certain probability towards each item, where the probability is determined by the match between the items structure and the induced constraints of the grammar (Probability Matching Theory; subjects match their probability of responding against their personal probability that the item is grammatical).One parameter versions of both models were specified and then tested against the data generated from three artificial grammar learning experiments.The experiments tested subjects three times on a set of test items; each model made distinct predictions about the distributions of subjects classifying correctly exactly three times p(3Cs), two times p(2Cs, 1E), one time, p(1C, 2Es) or not at all p(3Es).
PMT, but not HTT, successfully predicted that p(2Cs, 1E) should be greater than p(1C, 2Es) given that p > .50.However, PMT could not predict the amount of consistency in the data; unlike HTT, it systematically underpredicted p(3Cs).Both theories tended to underpredict p(3Es).
What could be the reasons for both theories failing to predict how consistent subjects are in responding?First we will consider a reason in terms of why the experimental data may not have been an appropriate test of the models.The predictions derived from the models assume that the successive tests of each subject are independent trials.If subjects use implicit or explicit memory of previous responses to same item, this may spuriously increase consistency.One way of testing whether subjects use memory is based on the assumption that the immediately preceding presentation of the same item should have more memory influence than the one before that, i.e., if subjects are using memory then p(E 3 /E 2 C 1 ) should be greater than p(E 3 /C 2 E 1 ), where E j refers to making an error on the jth trial.These probabilities are expected to be identical on both PMT and HTT accounts.Using the data from Experiment 3, p(E 3 /E 2 C 1 ) = .57(based on 159 observations) and p(E 3 /C 2 E 1 ) = .39(124 observations), the difference is significant, χ 2 = 8.92, df = 1, p < .005.That is, there is evidence that memory did influence subjects' responses.Future research could try to test subjects with repeated items seperated by more nonrepeated items than used in the current experiments, to try to minimize subjects' memory of previous responses.
Independently of subjects' memory for responses, there are also natural extensions of HTT and PMT that would produce greater consistency.HTT could be plausibly extended to produce higher p(3Es).In particular, a proportion, X, of subjects might always classify a string correctly, and a proportion,Y, might have induced nonrepresentative knowledge and always classify the same string incorrectly.The remaining (1-X-Y) subjects would guess randomly for that string.This two-parameter version of HTT should do better than the one-parameter version at predicting p(3Es) In terms of PMT, it is plausible that p is not constant across subjects.Dienes (1992) assumed that subjects did not differ in their p's, but if different subjects have different learning rates, different noisy encodings of a stimulus at any one learning episode, and different starting weights, the same architecture (such as any of those in Dienes, 1992) would give different p's for different subjects, before learning asymptotes.A distribution produced by subjects with different p's is characteristically different to a binomial by virtue of having greater variance.This is exactly what the data showed -more p(3Cs) and p(3Es) than predicted by a binomial; that is, a greater variance.Consider, for example, an overall p of 0.5 that resulted from some subjects having ps greater than .50 and some less than .50.Then p(3Cs) and p(3Es) could be .50each if a given subject had a p of either 0 or 1.
Different models of artificial grammar learning make different predictions about the behaviour of the parameters in two-parameter PMT.According to the connectionist models of Dienes (1992), subjects may start out with different p's, perhaps very discrepant ones if subjects start with very discrepant differences in weights, but the p's should converge with training -and at asymptote the variance in the p's should go to zero.On the other hand, Druhan and Mathews (1989) proposed a model of artificial grammar learning based on a classifier system, in which different subjects end up with different knowledge bases even after asymptotic learning.A prediction of the connectionist models but not the classifier system model is that one-parameter PMT should fit the data better and better as learning proceeds, and ultimately fit it perfectly, even for those strings for which the p remains very low.Experiments 1 and 3 both showed an increasingly good fit with training.Future research could determine whether the improvement continues with more extensive training.
Even if two-parameter HTT could account for subjects' consistency, it still in principle could not predict one finding from the data: that p(2Cs, 1E) was generally greater than p(1C, 2Es) given that p > .60.Inconsistent responding can only come from subjects who purely guess, and random guessing predicts that p(2Cs, 1E) should equal p(1C, 2Es).Augmenting two-parameter HTT with assumptions about subjects' memory for the previous response still could not explain the difference between p(2Cs, 1E) p(1C, 2Es).The problem might be with the assumption that when subjects do not know the answer they guess randomly; indeed, in the No Exposure Group of Experiment 1, when subjects had not learnt anything, they did not respond randomly.A further extension of HTT would be to take the behaviour of subjects in a no training group as a model of subjects' guessing behaviour and then assume that the only influence of learning is to add correctly or incorrectly known items; i.e. the ratio of p(2Cs, 1E) to p(1C, 2Es) should not change with learning, only p(3Cs) and p(3Es) would be incremented.However, in the No Exposure Group of Experiment 1, there were 90 cases classified correctly just twice and 92 cases classified correctly just once; i.e. p(2Cs, 1E) was virtually identical to p(1C, 2Es), if anything numerically smaller.Thus, it seems that it is the process of learning that produces the superiority of p(2Cs, 1E) over p(1C, 2Es); this is incompatible with any version of HTT.The superiority of p(2Cs, 1E) over p(1C, 2Es) is of course consistent with two parameter PMT.
In summary, this paper has explored the way in which subjects apply their implicit knowledge of artificial grammars in terms of models suggested by Reber (1967) and Dienes (1992), and indicated how these models could be extended in future research.
Appendix 1. Results for Experiment 1 NB The critical value of χ 2 at the p < .05level in each case is 7.81 (df=3).

Table 1
Stimuli used in Experiment 1 In the second section of this experiment you are to judge items that you have not yet seen.An item is 'grammatical' if it is based on the same structure as the ones you learned.Otherwise the item is 'not grammatical'.You have to decide whether an item is grammatical or ungrammatical and then you indicate on a percentage scale how confident you are in your decision.Choose 50% if you think it was a pure guess.If in your opinion your judgement goes beyond guessing choose a correspondingly higher percentage.Make the required indications with the 'mouse' by moving the arrowhead into the chosen area and then pushing the left mouse key."

Table 1 .
1 Results for the No Exposure Group

Table 3
Table 3.3 Results for the Triple Exposure Group