MEASURING EMPATHIC TENDENCIES : RELIABILITY AND VALIDITY OF THE DUTCH VERSION OF THE INTERPERSONAL REACTIVITY INDEX

————— Kim De Corte is Doctor, now attached at the Ghent University Hospital, Department of Child and Adolescent Psychiatry; Ann Buysse is Doctor at the Department of Experimental Clinical and Health Psychology, Ghent University; Lesley L. Verhofstadt is Doctor at the Department of Experimental Clinical and Health Psychology, Ghent University; Herbert Roeyers is Doctor at the Department of Experimental Clinical and Health Psychology, Ghent University; Koen Ponnet is Doctor, now attached at the University of Antwerp, Research Centre for Longitudinal and Life Course Studies; Mark H. Davis is Doctor at the Department of Behavioural Science, Eckerd College, St. Petersburg, USA. Correspondence concerning this article should be addressed to Kim De Corte, Ghent University Hospital, Department of Child and Adolescent Psychiatry, De Pintelaan 185, 9000 Ghent, Belgium. E-mail: kim.decorte@uzgent.be MEASURING EMPATHIC TENDENCIES: RELIABILITY AND VALIDITY OF THE DUTCH VERSION OF THE INTERPERSONAL REACTIVITY INDEX

years, the theoretical consensus about a multidimensional conception of empathy that comprises both cognitive and affective components has substantially grown (Kerem, Fishman, & Josselson, 2001; Thornton & Thornton, 1995). Over the years, various self-report measures of empathy have been developed. Currently, Davis' Interpersonal Reactivity Index (IRI;Davis, 1980) is the most widely and frequently used scale to measure individual differences in empathic tendencies (Pulos, Elison, & Lennon, 2004). Its popularity is attributable to several desirable qualities. First, this scale is the only one that is based on a multidimensional conceptualisation of empathy. Second, the IRI is regarded as the most comprehensive measure of self-reported empathic dispositions. Finally, this scale is relatively short and thus simple to administer.
Based on a multidimensional approach to empathy, the IRI was designed to assess a set of empathic tendencies, related in that they all have to do with the dispositional tendencies to be responsive to others, but also clearly discriminable from each other. To assess these different empathic dispositions, four seven-item scales were created: (a) Perspective taking (PT), the tendency to adopt another's psychological perspective, (b) Fantasy (FS), the tendency to identify strongly with fictitious characters, (c) Empathic concern (EC), the tendency to experience feelings of warmth, sympathy, and concern toward others, and (d) Personal distress (PD), the tendency to have feelings of discomfort and concern when witnessing others' negative experiences (Davis, 1994).
The PD and EC scales assess affective components, whereas the PT scale represents the cognitive component. Although the FS scale, with its focus on identifying with fictional characters, is frequently included in the "affective" components of the IRI, we find it harder to characterise it along the affectivecognitive dimension (see also Baron-Cohen & Wheelwright, 2004). The four IRI scales exist in the same instrument because they represent separate facets of what is termed "empathy". Davis and colleagues' data (see Davis, 1980Davis, , 1983Davis & Franzoi, 1991) validated the IRI's multidimensional conceptualisation of empathy by demonstrating that the four dimensions constituted unique but related aspects of empathy and provided further evidence for this theory through predicted significant relationships of the IRI scale scores with interpersonal functioning, social competence and other empathy-related measures A number of studies have shown that the IRI provides a reliable and valid way of measuring people's empathic tendencies via self-report (see Davis, 1994, for a review). However, although the IRI shows much promise, there is still some need to further investigate certain validity issues. First, some uncertainty about the IRI's factor structure exists as research on the structure has revealed different results. Consequently, further examination of the fac-tor structure of the IRI is desirable -in particular, theory-based validation of its factor structure. Second, it would also be advisable to further investigate convergent and discriminant validity. Third, although the IRI has been translated into several languages (e.g., Swedish, Spanish, French, Chinese, and German), a reliable and validated Dutch version of this instrument does not yet exist.
Thus, the purpose of the present paper is to describe the development of a Dutch version of the IRI and to evaluate the psychometric properties of its obtained scores. The first goal is to examine the hypothesised four-factor structure of the IRI scores, and to assess the internal reliability of the subscale scores. The second goal is to examine the construct validity evidence for scores of the new translation using a large Dutch sample. Third, we will examine evidence for the convergent and discriminant validity of its scores by examining associations with scores of other relevant measures.
Factor structure and scale reliability of the IRI Evidence regarding the underlying structure of the IRI has been mixed. Some studies have found the presence of a stable four-factor structure consistent with the four IRI subscales (e.g., Litvack-Miller, McDougall, & Romney, 1997), while other studies have found alternative (and mutually inconsistent) factor solutions (e.g., Alterman, McDermott, Cacciola, & Rutherford, 2003;Cliffordson, 2002;Pulos et al., 2004;Siu & Shek, 2005). For example, some studies (e.g., Cliffordson, 2002) found a higher-order model with two global factors; others found that a unidimensional structure best represented the IRI data (e.g., Alterman et al., 2003). Two possible explanations may account for this pattern. First, the majority of these studies applied an empirical model testing procedure to evaluate the best model fit, rather than testing specified a-priori model structures based on theoretical considerations. Although empirical model testing can be useful, the problem here is that the rationale for applying certain modifications to the model is determined post hoc, and such models cross-validate very badly (MacCallum, Roznowski, & Necowitz, 1992). A second potential explication is that the use of different translations of the instrument may account for this pattern (Brislin, 1988).
Bearing in mind that establishing the factor structure of a measure is essential to the credibility of empirical findings and theory development (Byrne, 1994), our first goal was to evaluate the factor validity of the Dutch IRI using Confirmatory Factor Analysis (CFA). Confirming Davis' four-factor structure in a different culture would help validate the score structure of the Dutch IRI.

Construct validity of the scores of the Dutch translation
An additional goal of this investigation was to examine the validity of the Dutch IRI by examining whether the subscale scores display relationships consistent with prior work using the original IRI.

Scale intercorrelations
One method of establishing construct validity is to examine the pattern of correlations among the four IRI scales. This intercorrelation pattern has shown to be fairly consistent across prior studies (e.g., Carey, Fox, & Spraggins, 1988;Cliffordson, 2001;Pulos et al., 2004) and thus provides additional evidence for the scale scores' construct validity. Accordingly, the expected pattern of correlations between the scores of the Dutch IRI scales is as follows: a) EC scores will be significantly and positively associated with FS and PT scores, b) PD will be either negatively correlated with, or independent of PT and EC scores, and c) FS scale scores will be independent of PT and PD scores.

Gender differences
In the literature, empathy is considered a gendered belief (Shields, 1995) that entails the assumption that women are more emotional and more caring than males (Zahn-Waxler, Cole, & Barrett, 1991). Consistent with this view, women frequently score significantly higher than men do on self-reported empathy (Eisenberg & Miller, 1987;Hojat et al., 2002). More specifically, women generally score higher on all four IRI scales (e.g., Davis, 1980). Accordingly, one method for evaluating construct validity is to examine gender differences for the obtained scale scores of the Dutch IRI.
Convergent and discriminant validity of the Dutch IRI Another approach to establishing the validity of the IRI scale scores is to examine their relationships with scores of other, related scales or instruments. The relationships between the four IRI scale scores and the scores of seven potentially associated constructs are considered in this paper. These constructs are emotional intelligence (EQ), the Big Five personality traits, Machiavellianism, self-esteem, and three intellectual ability indices. Each of these constructs, with exception of the intellectual ability indices, is expected -on theoretical, logical, or empirical grounds -to be related to one or more of the IRI scale scores.

EQ
Mayer and colleagues (e.g., Mayer, Caruso, & Salovey, 1999) have shown that EQ, defined as the ability to be aware of and express, assimilate, understand and manage one's emotions, is positively correlated with empathy indicators. However, to date the correlations between the IRI scale scores and EQ have not been investigated. To assess EQ, we used the Emotional Quotient Inventory (EQ-i; Bar-On, 1997), a self-report measure that taps the Intrapersonal, Interpersonal, Adaptability, Stress Management, and General Mood components of EQ. The Interpersonal component -the ability to be aware of and understand another's feelings (Bar-On, 1997) -is expected to be positively related to the PT and EC scales as these scales deal with one's tendency to imagine others' perspectives and to experience other-oriented emotions. The other components of the EQ-i focus more on emotional processes occurring within the individual (see Bar-On, 1997). Therefore, we expect the Intrapersonal, Stress Management, Adaptability, and General Mood components -all of which reflect the successful regulation of emotion -to be negatively correlated with scores on the PD scale and independent of PT and EC scores. How FS scores will relate to the scores on the different EQ-i scales is unclear.

Personality traits
One way to conceptualise and operationalize personality is in terms of five basic factors, labelled the 'Big Five': Extraversion, Agreeableness, Conscientiousness, Neuroticism, and Openness (Costa & McCrae, 1992b). The NEO-Five Factor Inventory (NEO-FFI; Costa & McCrae, 1992a) is one measure developed to operationalize the Big Five model. Empathy is expected to correlate with various traits of the 'Big Five' (Del Barrio, Aluja, & García, 2004), but again the associations between the IRI scale scores and the scores of the personality traits have not received much attention. It is expected that higher scores on the Neuroticism factor will be associated with higher scores on the PD scale (Shiner & Caspi, 2003). Higher scores on Agreeableness and Extraversion -the primary dimensions of interpersonal behaviour (Costa, McCrae, & Dye, 1991) -are expected to be positively correlated with EC and PT scores. Scores on Openness are expected to be associated with the FS, PT, and EC scale scores, since Openness shows significant positive correlations with pro-social behaviours (Kosek, 1995). No relationships are expected between Conscientiousness scores and the IRI scale scores, because it is not apparent that a tendency to be well organised, selfdisciplined and dutiful will be related to one's empathic tendencies.

Machiavellianism
Machiavellianism (Mach) is regarded as a cluster of traits characterised by distrust, cynicism, selfishness, and a tendency for interpersonal manipulation (McHoskey, Worzel, & Szyarto, 1998). Relative to people with low scores on Mach, high Machs lack interpersonal warmth as well as the ability to identify emotions; as such, they may consequently have a diminished capacity to be empathic (Wastell & Booth, 2003). Several studies provide evidence for this assumption (e.g., Valentine, Fleischman, & Godkin, 2003). In line with previous research, we expect scores on PT and EC scales to be negatively related to scores on Mach, given that both scales address the tendency to concern oneself with another's state of mind. Neither PD nor FS scale scores are expected to display a significant relation with Mach scores.
Self-esteem Self-esteem is defined as a global orientation characterised by self-oriented positive emotionality (Robins, Tracy, Trzesniewski, Potter, & Gosling, 2001). In view of this, it seems likely that self-esteem will be most strongly related (negatively) to the scores on the PD scale, as personal distress is a negative emotional reaction in response to another's distress (Batson, 1991). Because engaging in pro-social behaviour may be important in the development of feelings of self-worth (Laible, Carlo, & Roesch, 2004), we might presume that having the tendency to adopt another's perspective may be positively related to self-esteem. These theoretical assumptions are in accordance with Davis' (1983) empirical findings. Consequently, the following predictions can be made: 1) PT scores will be positively associated with selfesteem, 2) PD scores will be negatively related with self-esteem, and 3) no relations are expected between self-esteem and either FS or EC scores.

Intellectual ability indices
Previous research examining the association between the original IRI scales and measures of intelligence has found little consistent association (e.g., Davis, 1983;Mayer & Geher, 1996). In line with this, we expect to find no consistent pattern of relationships between intellectual ability -measured by means of an IQ test -and the IRI scale scores.

Participants and procedure
Data were drawn from eight studies conducted with 651 Belgian participants. The participants were solicited using two methods. Advertisements were placed in magazines recruiting individuals who were willing to participate in a research project on empathy (13% of participants). In addition, a snowball sampling procedure was used to obtain the remaining participants (87%). First, a team of research assistants recruited individuals in their personal social network. In a second step, additional participants were obtained from this initial sample. The persons who responded positively to either recruitment method were given a standard description of the research (e.g., aims and procedure). The sample consisted of 299 men (46%) and 352 women (54%). The mean age of the men was 24.48 years (SD = 4.79) and of the women 27.37 years (SD = 5.42). Seventy-three percent of the participants were unmarried, 20% cohabiting, and 7% married. After providing their informed consent, all participants completed a package of questionnaires in a quiet room as part of a wider testing session.

Materials
The composition of the questionnaire package varied across the eight studies. All the participants completed a questionnaire inquiring into selfreported empathy (N = 651), while the other measures were completed by only some of the participants: EQ (n = 310), personality traits (n = 235), Machiavellianism (n = 182), and self-esteem (n = 221). In two small studies (n 1 = 37, n 2 = 36), an intelligence test was administered and subsequently participants' Total IQ, Verbal IQ, and Performance IQ were calculated.

Empathy
Empathic tendencies were assessed using the Dutch version of the IRI. The English version of the IRI was previously translated into Dutch/Flemish by the fourth author (see Roeyers, Buysse, Ponnet, & De Corte, under revision). To pursue semantic equivalence to the original IRI measure, the Dutch translation was conducted in accordance with the standardised back-translation procedure (Bontempo, 1993). (The items of the final Dutch version of the IRI appear in Appendix A.) The IRI consists of 28 items. Participants are asked to indicate the extent to which each item describes them on a 5-point Likert scale ranging from 0 (does not describe me well) to 4 (describes me very well). PT, FS, EC, and PD scale scores were computed by summing the scores on the seven items, so that the minimum (0) and maximum (28) score of each subscale is the same. (We refer the reader to the results section for the internal consistency reliability of the scores on the IRI scales.)

EQ
Bar-On's Emotional Quotient Inventory (EQ-i; Bar-On, 1997) comprises 133 items, scored on a 5-point scale anchored by 1 (very seldom or not true of me) to 5 (very often true of me or true of me). This self-report measure assesses the trait indicators of EQ and provides a Total EQ score and five composite scale scores. The five composite scales represent the Intrapersonal, Interpersonal, Adaptability, Stress Management, and General Mood components of EQ. Raw scores on scales are transformed into standard scores. The Dutch version of the EQ-i was used (Derksen, 1998). Participants were excluded if any of the four validity indices suggested that the results were invalid (see Bar-On, 1997). In the present study, 17 participants were excluded from further analysis involving emotional intelligence based on these criteria. The EQ-i produced an overall Cronbach's Alpha coefficient of .93. The reliability coefficient values for the composite scales were .91 for the Intrapersonal, .77 for the Interpersonal, .79 for the Adaptability, .81 for the Stress Management, and .84 for the General Mood EQ component.

Personality traits
The Dutch version (Hoekstra et al., 1996) of the NEO-Five Factor Inventory (NEO-FFI) is a short version of the NEO-PI-R Personality Inventory (Costa & McCrae, 1992a), and was administered to assess the Big Five personality traits: Extraversion, Agreeableness, Conscientiousness, Neuroticism, and Openness. Participants were presented 60 items, 12 for each domain, and were asked to indicate the extent to which they agreed or disagreed with each statement on a 5-point Likert scale. In this study, the Cronbach's alpha coefficients for the five personality domains were .78 for Extraversion, .69 for Agreeableness, .82 for Conscientiousness, .84 for Neuroticism, and .71 for Openness.

Machiavellianism
The Dutch version of the Mach-IV (Christie & Geis, 1970; Dutch version by Van Kenhove et al., 2001) is a 20-item inventory that measures the use of interpersonal manipulation strategies and agreement with Machiavellian statements. Items are scored on a 7-point Likert scale from 7 (strongly agree) through 4 (no opinion) to 1 (strongly disagree). Cronbach's alpha in this study was .66.

Self-esteem
The Dutch version of the Rosenberg Self-esteem Scale (Rosenberg, 1965) assesses a person's feelings of self-acceptance and self-worth. The statements of this 10-item scale are rated on a 4-point Likert scale ranging from 0 (strongly agree) to 3 (strongly disagree). Cronbach's alpha in this study was .76.

Intellectual ability indices
General intelligence was measured by means of the Dutch version of the Wechsler Adult Intelligence Scale-Third Edition (WAIS-III; Wechsler, 2000). For each participant a Total IQ score, a Verbal IQ score, and a Performance IQ score was calculated.

Statistical analyses
CFA was conducted using LISREL window version 8.50 (Jöreskog & Sörbom, 2001) to examine the factor structure of the IRI scores. According to the literature on CFA, the goodness-of-fit was evaluated based on several fit indices (Hu & Bentler, 1999). With a large sample size, as is the case in the present study, the χ 2 test statistic will almost certainly be significant, even when there are good-fitting models (Gerbing & Anderson, 1993) 1 ; therefore, χ 2 /df ratio is also reported. A χ 2 /df ratio of 2:1 to 5:1 is required (Marsh & Hocevar, 1985) and indicates an acceptable fit, but values of less than 3 are considered favourable in large sample analyses (Kline, 1998). In addition, we examined several indices that are less sensitive to sample size (Marsh & Balla, 1994): (1) the comparative fit index (CFI), (2) the goodness-of-fit index (GFI), (3) the adjusted goodness-of-fit index (AGFI), (4) the root mean square error of approximation (RMSEA), and (5) the standardised root mean square residual (SRMR). The GFI index is an absolute fit index, and CFI and AGFI are incremental fit indices (Jöreskog & Sörbom, 2001). For these three fit indices, values greater than 0.90 indicate an acceptable fit. RMSEA, which is a non-centrality based index, is a highly recommended tool in the evaluation of model fit. A value of about 0.05 (or less) for RMSEA would indicate a close fit of the model and a value of about 0.08 would indicate a reasonable fit. The 90% confidence interval (CI) around the RMSEA point estimate should contain 0.05 to indicate the possibility of close model-data fit (Browne & Cudeck, 1993). The fifth indicator is SRMR, a standardised summary of the average covariance residuals (Kline, 1998). A relatively good fit of the model is indicated when the SRMR is smaller than 0.08 (Hu & Bentler, 1999).

Distributional properties of the IRI items
Prior to further analyses, skewness and kurtosis statistics were used to inspect the distribution of the responses to the IRI items. All items and factors displayed skewness and kurtosis statistics within an acceptable range ----- (Byrne, 1998). The percentage of missing values was negligible (0.31%), and distributed across the items. Therefore, these values were substituted with the mean value of the relevant variable (Gold & Bentler, 2000). All variables were included in the analyses, because the descriptive statistics showed that all items approximated the normal distribution (Muthén & Kaplan, 1992).

Factor structure of the IRI scores
We first attempted to replicate the four-factor structure identified by Davis (1980) by using CFA and utilising the iterated maximum likelihood procedure to estimate the four-factor model 2 . The observed variance/covariance matrix was used for input on all analyses. Factors were allowed to correlate (analogous to an oblique rotation). Each item of the IRI was allowed to load freely on its hypothesised factor, but was not allowed to load on other factors. However, error covariances between observed variables were not allowed to correlate. The fit indices are: χ 2 /df = 2.93, CFI = 0.86, GFI = 0.90, AGFI = 0.87, RMSEA = 0.06 (90% CI = 0.05-0.06), SRMR = .06, AIC = 1219.06. Although some fit indices indicated an acceptable-to-reasonablygood model fit to the data (χ 2 /df ratio < 3 and RMSEA = 0.06), the values of the other fit indices were acceptable but not excellent. Even though Davis' four-factor model provided a reasonable fit to the data, some improvement in model fit is possible.
An investigation of the modification indices suggested that substantial improvement in this model could be gained if error covariances of the items making up the FS scale were allowed to be estimated freely. Thus, these modification indices suggest that there is an unusually high level of semantic overlap among the FS items. Why might this be? One possibility has to do with the process by which the FS scale was initially created. The starting point was a set of three items from Stotland's Fantasy-Empathy scale (FES; Stotland, 1969); new items were then created to match their content (see Davis, 1980Davis, , 1983. The three original FES items all focus on transposing oneself into fictitious works (e.g., books, movies), and the additional four items largely reflect the same content. Given this strong semantic overlap between these seven Fantasy items, we can be more relaxed about freeing up the covariance between them (Y. Rosseel, personal communication, April 2, 2005).
-----2 To investigate whether the IRI items measure general empathy, a hierarchical model could be tested in which the four latent factors, PT, FS, EC, and PD, load freely on one second-order latent construct, here, general empathy. However, such a model is mathematically equivalent to a four-factor model in which the four latent factors are allowed to correlate freely, and both models will provide identical fit to the data (Bollen, 1989). An examination of the correlations between the four latent factors should sufficiently inform us about their shared variance.
Note. PT = Perspective taking; FS = Fantasy; EC = Empathic concern; PD = Personal distress. Thus, based on both theoretical arguments and modification indices, seven modifications were made to the original model (see Figure 1); specifically seven error covariances between the FS items were freed up. The error covariances to be freely estimated are between IRI item 7 and 12, between IRI item 16 and 23, between IRI item 5 and 12, between IRI item 7 and 26, between IRI item 12 and 16, between IRI item 1 and 26, and between IRI item 12 and 26. Importantly, this strategy did not require adding or deleting any paths between the observed and latent variables. This modified model was tested and the fit indices were: χ 2 /df = 2.47, CFI = 0.90, GFI = 0.91, AGFI = 0.90, RMSEA = 0.05 (90% CI = 0.04-0.05), SRMR = 0.06, AIC = 1014.74. The values of the fit indices for this revised CFA were noticeably improved, with relatively minor modifications.
We used Akaike's (1973) Information Criterion (AIC) to evaluate the competing models: the model with the lowest AIC is preferred (Bozdogan, 2000). As the results show (see above), the AIC criterion favours the modified four-factor model (i.e., AIC = 1014.74) rather than Davis' four-factor model (i.e., AIC = 1219.06). Since the fit indices indicated that the modified four-factor model 3 offered a statistically more adequate account of the data than Davis' four-factor model, the standardised factor loadings of each item of this modified four-factor model were examined. Table 1 displays the loadings for the modified four-factor solution. As can be seen, all factor loadings are significant and above .32.

Internal reliability of the IRI scale scores
Cronbach's alpha coefficients were calculated for the scores of each of the four IRI scales. As presented in Table 2, the results indicate that that the four scales of the IRI have satisfactory internal consistency in this Dutch sample.
-----3 In addition, we examined the measurement invariance of the obtained scores on the Dutch version of the IRI across two independent groups by means of the full measurement invariance test (Kline, 1998). Therefore, the sample (N = 651) was randomly split into two independent sub-samples (n = 325 in each sub-sample). Both the modified four-factor model without equality constraints (i.e., unconstrained model) and the very restrictive four-factor model (i.e., constrained model: equating the factor loadings, factor correlations, and error variances) fitted the data adequately across both sub-samples. Moreover, the change in overall χ 2 between the unconstrained and the constrained model was statistically not significant, ∆χ 2 (69) = 4.11, p > .05, indicating that the factor loadings, factor correlations, and error variances were invariant across both independent sub-samples (Vandenberg & Lance, 2000). In sum, the modified four-factor model was invariant across the two independent sub-samples.  (1) Factors were allowed to correlate and that (2) each item was allowed to load freely on its hypothesised factor but not allowed to load on other factors. All standardized factor loadings are significant at p < .001. IRI = Interpersonal Reactivity Index; PT = Perspective taking facto r; FS = Fantasy factor; EC = Empathic concern factor; PD = Personal distress factor.

Scale intercorrelations
In Table 2, the relationships among the scores of the IRI scales are presented with the magnitude of the correlations ranging from -.09 to .37. As expected, EC scores were significantly and positively related to PT and FS scores. In addition, the correlation between PT and PD scores was weak; given the size of the sample, this correlation is significant, although small in size (see Cohen, 1992). Other substantial and significant positive correlations were between PD and EC scale scores, on the one hand, and between FS and both PD and PT scale scores, on the other hand.

Gender differences
To assess gender differences in scores on the four IRI scales while controlling for the multiple comparisons, we used a multivariate analysis of variance (MANOVA) with gender as the independent variable and the four IRI scales as the dependent variables. The analysis revealed a significant main effect for gender, Wilks's lambda = 0.77, F(6, 646) = 47.44, p < .001, η 2 = .23. Furthermore, the results revealed that the effect of gender was significant for all four scales, with women scoring higher than men on each one (see Table 3). The effect sizes of the FS, EC, and PD scale were in the range that Cohen (1988) describes as "large". A medium effect size, which is approximately 0.50 standard deviation units, was found for the PT scale. Table 4 displays the Pearson correlation coefficients between the four scale scores of the Dutch IRI and the scores of a variety of other instruments. Inasmuch as statistical significance is largely dependent on sample size, the effect size provides a more informative index of relations between study variables. The estimations of effect size are based upon Cohen's (1992) criteria from the magnitude of correlation coefficients: Values less than 0.1 are regarded as insubstantial, values from 0.1 to 0.3 as small, values of 0.3 to 0.5 as moderate; and values greater than 0.5 as large. In every case, the effect size could be described as small or moderate. Because samples differed greatly in size across the different instruments, differentiation between small and moderate is not considered justified. Note. IRI = Interpersonal Reactivity Index; PT = Perspective taking; FS = Fantasy; EC = Empathic concern; PD = Personal distress. a The effect size measure used is Cohen's d (Cohen, 1988). b Df = (1, 649). * p < .001.

EQ
Correlations between IRI scores and scores on the EQ measure were largely as expected. Scores on the PT scale of the IRI were moderately and positively associated with Total EQ scores, and as expected, this was the result of PT scores being most positively related to the Interpersonal dimension of EQ. Higher PT scores were also associated with being able to cope with stress and being flexible in social settings. Scores on the EC scale were also positively associated with better "interpersonal" EQ scores, again consistent with expectations. They were not, however, associated with Total EQ. Scores on the PD scale were negatively associated with Total EQ, but this pattern resulted not from poorer interpersonal abilities, but from the predicted lower scores on the Intrapersonal dimension; higher PD scores were also associated with lower tolerance for stress and lower levels of optimism. Finally, scores on the FS scale were not related to most of the EQ domains, although higher FS scores were positively associated with greater scores on the Interpersonal measures of EQ. Note. IRI = Interpersonal Reactivity Index; PT = Perspective taking; FS = Fantasy; EC = Empathic concern; PD = Personal distress. a Derksen (1998); b Hoekstra et al. (1996); c Van Kenhove et al. (2001); d Rosenberg Self-esteem Scale (1965); e Wechsler (2000). * p < .01, ** p < .001.

Personality traits
With regard to the Big Five traits, results were again largely in accord with predictions. Scores on the PT scale were associated with being open-minded and agreeable. Scores on the EC scale were moderately and positively associated with Agreeableness scores. Scores on the PD scale were also moderately and positively associated with neuroticism. Scores on the FS scale were moderately and positively associated with greater open-mindedness. Conscientiousness and Extraversion did not display consistent relationships with any of the IRI scale scores.

Machiavellianism
Regarding Mach, results were only partially in accord with predictions. Consistent with expectations, scores on the EC scale were negatively associated with Mach, and scores on the FS and PD scales were unrelated with Mach. However, the PT scale score was non-significantly related to Mach scores, inconsistent with expectations.

Self-esteem
Scores on the PD scale were, consistent with predictions, negatively associated with self-esteem. Both the EC and FS scale score displayed no relation with self-esteem, again consistent with expectations. However, unexpectedly, higher scores on the PT scale were not significantly associated with higher self-esteem.

Intellectual ability indices
With regard to the intellectual ability indices, results were largely in accord with predictions. None of the IRI scales, except for PD, were related to any intellectual ability measure. The PD scale score was negatively associated with the Total IQ and Verbal IQ, inconsistent with expectations.

Discussion
The current study sought to examine the psychometric properties of the scores of the Dutch version of the IRI. Almost without exception, the results supported the psychometric adequacy of the scores of this version in terms of factor structure and scale reliability, construct validity as reflected in scale intercorrelations and gender differences, and the discriminant and convergent validity as evidenced by correlations with other related measures. Thus, the Dutch version of the IRI appears to be a useful complement to the original instrument.

Factor structure
By employing CFA, we examined the factor structure of the IRI. The first aim was to determine whether Davis' four-factor model -based on both empirical and theoretical considerations -represented the score structure of the Dutch IRI. Goodness of fit indices suggested that the fit between the fourfactor model and the data was acceptable but not excellent. To improve model fit, we made some post hoc adjustments in Davis' four-factor structure by allowing within-factor correlated measurement errors for some of the items making up the FS scale. CFA revealed that this modified four-factor model provided a better fit to the data. Thus, the key question is: What can explain this need to relax certain constraints in order to achieve an adequate fit for the four-factor model?
Three possible reasons can be advanced to explain the presence of the within-factor correlated measurement errors in the IRI scores (Netemeyer, 2001). First, there may be some semantic overlap that gives rise to covariation between the FS items, above and beyond any covariation that may exist between the concepts that the FS items tap. This result might thus indicate that the unidimensional measurement of the FS factor is threatened, as an extra source of correlation in variance exists in this factor. It should be noted in this regard that unlike the items on the other three IRI scales -all of which were written expressly for the IRI -the items making up the FS scale came from two separate sources. As mentioned previously, four FS items were written for the IRI, but three others were taken from Stotland's (1969) FES. Perhaps the different origins of these two sets of items helps account for this pattern. It is also possible that something about the idiosyncrasies of translating the FS scale into Dutch created additional overlap in semantic content for this scale. It would be informative to conduct similar analyses on scores of other translations of the IRI to see if the need for within-factor correlated error variances appears for the FS scale score in other languages as well. A second possibility is that there are unwanted or unexplained sources of correlation in the variance beyond the four factors specified a priori in the measurement model of the IRI. In other words, it could be that the covariation between the IRI items of the FS scale has not been adequately accounted for by the four factors of Davis' factor structure. Finally, it is possible that these correlated errors are sample idiosyncratic and may not replicate to other samples. However, given that these correlated error variances were obtained from a relatively large sample (Cote, 2001), and that these correlated error variances were homogenous across two random sub-samples (see Footnote 3), it seems less likely that this pattern is a simple anomaly.
The FS scale is also one for which a reasonably strong case can be made for eliminating a scale item. IRI item 1 emerged as a relatively weak indica-tor of the FS factor in terms of factor loading, a finding consistent with results reported by Davis (1980). Content analysis of the FS items further revealed that item 1 does not reflect the tendency to empathise with another person; in contrast, the remaining six FS items all assess the tendency to imagine oneself in another person's position. This might explain why this item appears to be a relatively weak contributor to the FS factor in the IRI. This theoretical rationale allows us to consider eliminating this item -as long as it does not appreciably reduce reliability -since doing so might generate higher overall semantic coherence within the FS scale (Frary, 2000) 4 . Whether this is also true of the English IRI and other translations remains to be seen.
Furthermore, we support other authors' assumption to question the relevance of including the FS scale for the measurement of pure empathy (Baron- Cohen & Wheelwright, 2004;Lawrence et al., 2004).

Construct validity
The internal consistency coefficients of the four Dutch IRI scales range from acceptable to high. We found relationships of relatively low strength between the IRI scale scores, which seems to be logical and theoretically meaningful. The intercorrelations among these scale scores suggest that PT, FS, EC, and PD are four statistically related but also (relatively) discriminable constructs. Moreover, the gender differences found for each of the four IRI scales are consistent with traditional gender stereotypes that women are more emotional and more caring than men (Zahn-Waxler et al., 1991) and thus perceive themselves as being more empathic than men. This pattern is also consistent with the sex differences typically found with the original IRI (Davis, 1980). Thus, these results provide additional support for the construct validity of the IRI scale scores.

Convergent and discriminant validity
Evidence for convergent and discriminant validity of the Dutch IRI scale scores came from their pattern of relationships with the scores on measures of EQ, the Big Five, Machiavellianism, self-esteem, and three intellectual ability indices. Overall, the data indicated high levels of convergent and discriminant validity for each of the four IRI scale scores.
As expected, no IRI scale score, except for the PD scale score, was associated with any of the intellectual ability. Nonetheless, this finding gives additional evidence for the discriminant validity of the scores on three IRI scales. The fact that the PD score was negatively related to Total and Verbal IQ was a surprising finding. One hypothetical explanation is that during episodes of intense distress, stable traits, like Total and Verbal IQ (Furnham, Forde, & Cotter, 1998), may be somewhat contaminated by current distress levels and might significantly lower self-esteem (e.g., Ormel & Schaufeli, 1991). In other words, people who tend to experience high levels of personal distress might report lower levels of Total and Verbal IQ due to less self-confidence.
Moreover, there were clearly different patterns of relations between the scores of each IRI scale and those of other psychological measures. The PT scale score was related to overall EQ more strongly than was any other IRI scale. This relationship was primarily due to the Interpersonal component, but the PT scale score was also associated with better Stress Management and Adaptability. Scores on the PT scale were also related to scores on the Openness and Agreeableness dimensions of the NEO-FFI. These findings seem to indicate that persons with high PT scores are able to regulate emotions and thus function smoothly in social environments. With regard to the EC scale, scores on this IRI scale were also associated with better Interpersonal EQ but not with better Stress Management or Adaptability. EC scores were also related to high scores on the Agreeableness personality trait and low scores on the Mach-IV scale. These results seem to indicate that persons with high EC scores are somewhat good-natured, warm-hearted, and non-manipulative; these are all qualities that can enhance social success.
The PD scale was negatively related to total EQ, and through a much different path than the PT scale score. PD scale scores were associated with lower ability to control one's own stress and mood, and the lack of intrapersonal EQ skills. Completing this pattern was the fact that higher PD scores were associated with low self-esteem and higher neuroticism. Those high in personal distress thus seem to be at the mercy of their emotions and cannot regulate them in an effective way; this may contribute to their more negative self-evaluation. The FS scale score was the only scale that was not very much related to the other psychological variables. These findings strongly suggest that the FS scale is the least 'social' of the four IRI scales.

Conclusion
In sum, the findings presented in this study give evidence for the reliability and validity of the Dutch version of the IRI and indicate that this scale is useful in measuring the perception of empathic tendencies in a Dutch sample. The findings, however, should be interpreted in the context of certain limitations. The replication of this investigation with other samples within the Dutch population could strengthen conclusions regarding the validity and reliability of the scores on the Dutch version of the IRI. Especially, since the present investigation is the first empirical analysis of the measurement invariance of the IRI, our results await replication in other samples. Furthermore, it should be taken into account that our data are based on self-report measures only as we refrained from including behavioural (non self-reported) indicators of empathic responding. Consequently, it should be noted that associations between the variables under study might be spuriously inflated or otherwise distorted due to shared method variance (Lorenz, Conger, Simons, Whitbeck, & Elder, 1991). Additionally, we did not explicitly test the IRI by comparing it with other existing empathy measures. Thus, incorporation of additional self-report empathy measures and alternative assessment methodologies into future research may provide additional evidence needed for conclusions related to convergent and discriminant validity.