From the 1Faculty of Applied Health and Social Sciences, Technical University of Applied Sciences Rosenheim, Rosenheim, Germany, 2Swiss Paraplegic Research, Nottwil, 3Department of Health Sciences and Medicine, University of Lucerne, Lucerne, Switzerland, 4Department of Physical Medicine and Rehabilitation, Faculty of Medicine, Ankara University, 5Department of Biostatistics, Faculty of Medicine, Ankara University, Ankara, Turkey, 6Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark and 7Leeds Institute of Rheumatic and Musculoskeletal Medicine, University of Leeds, Leeds, UK
Objectives: To develop a common reference metric of functioning, incorporating generic and health condition-specific disability instruments, and to test whether this reference metric is invariant across 2 health conditions.
Design: Psychometric study using secondary data analysis. Firstly, the International Classification of Functioning, Disability and Health (ICF) Linking Rules were used to examine the concept equivalence between the World Health Organization Disability Assessment Schedule (WHODAS 2.0), Health Assessment Questionnaire (HAQ) and Functional Independence Measure (FIMTM). Secondly, a scale-bank was developed using a reference metric approach to test-equating, based on the Rasch measurement model.
Participants: Secondary analysis was performed on data from 487 people; 61.4% with rheumatoid arthritis and 38.6% with stroke.
Results: Three sub-domains of the WHODAS 2.0 and all items of the HAQ and FIMTM motor mapped on to the ICF chapters d4 Mobility, d5 Self-care and d6 Domestic life. Test-equating of these scales resulted in good model fit, indicating that a scale bank and associated reference metric across these 3 instruments could be created.
Conclusion: This study provides a transformation table to enable direct comparisons among instruments measuring physical functioning commonly used in rheumatoid arthritis (HAQ) and stroke (FIMTM motor scale), as well as in people with disability in general (WHODAS 2.0).
Key words: psychometrics; outcome assessment; stroke; rheumatoid arthritis; World Health Organization Disability Assessment Schedule; WHODAS 2.0; Functional Independence Measure; FIMTM; Health Assessment Questionnaire; HAQ.
Accepted Aug 18, 2020; Epub ahead of print Sep 8, 2020
J Rehabil Med 2020; 52: jrm00107
Correspondence address: Birgit Prodinger, Technical University of Applied Sciences Rosenheim Hochschulstr. 1, 83024 Rosenheim, Germany. E-mail: email@example.com
Functioning is what matters most to people with chronic health conditions, such as stroke or rheumatoid arthritis. While medical signs and symptoms related to these health conditions may vary widely, research has shown that people may experience similar problems with func-tioning. Therefore, being able to monitor and compare functioning over time is essential for the planning and allocation of rehabilitation. This study provides evidence that a common measure can be created, based on a single general disability instrument and 2 health condition-specific instruments. For clinical practice this implies that standardized reporting of functioning can be achieved based on a common measure, while data collection can continue using the commonly used and established instruments.
Functioning, as the third health indicator in health systems, complements information about mortality and morbidity by providing information about how a health condition plays out in everyday life (1). Func-tioning includes information about what a person does in everyday life, including moving around, getting dressed,
doing housework or participating in paid work, as well as the interaction of these activities with the health condition, impairments in body structures and functions, and with contextual factors. A detriment in any domain of functioning refers to disability (2). The number of people living with disability worldwide is increasing steadily (3). For various chronic health conditions, including stroke and rheumatoid arthritis (RA), although indicators of mortality and morbidity are declining, the number of people who experience long-term disability after having been diagnosed with such a health condition is increasing (4, 5).
Rehabilitation is a strategy aimed at optimizing functioning (6). As such, it is essential to monitor functioning and disability, as well as to set targeted interventions at the individual and population level. Nevertheless, a lack of data on functioning has been continuously reported (7). Functioning information is predominantly collected with a focus on single health conditions. While this may be justified and necessary for certain purposes, it has been shown that people with various disorders, including stroke, multiple sclerosis, and RA experience similar functioning problems in their everyday life despite different underlying health conditions (8). Consequently, in order to compare functioning across diverse conditions, information is needed that is invariant across those conditions. Invariance implies that, at the same level of functioning, an instrument measuring functioning has the same meaning and yields a comparable score across relevant groups.
At least 2 approaches can be utilized for documenting and reporting functioning information that is invariant across health conditions. First, generic disability instruments can be used that have been shown to be both reliable and valid across the relevant health conditions. Secondly, transformation tables can be established, that enable the comparison of disability scores using differ-ent instruments across health conditions. Regarding the first approach, the World Health Organization Disability Assessment Schedule (WHODAS 2.0) is a generic disability instrument, which has been translated into various languages and its psychometric properties have been tested in various health conditions (9). However, no study has been conducted to date that has examined whether the WHODAS 2.0 is invariant across different health condition groups, such as musculoskeletal and neurological disorders. Regarding the second approach, previous research has established the principles of how to develop a transformation table allowing the reporting of scores of different instruments on a reference metric (10). To our knowledge, to date, no study has examined whether a reference metric can be established across multiple scales from different health condition groups.
The objective of this study was to create a reference metric underlying instruments commonly used along the continuum of care to measure functioning domains in people with various chronic health conditions. More specifically, the aims were:
A “reference metric” is defined here as one upon which 3 or more instruments are calibrated, whereas a “common metric” is a co-calibration of 2 instruments.
A psychometric study was conducted using secondary analysis of data collected previously. A common item, non-equivalent person design was deployed in an innovative manner by using the total scores from scales as partial credit items in order to equate tests (11). Test-equating applications have a long tradition in education and psychology (12), whereas their application in health was rare until recently, where, for example, one study linked 6 sleep disorder scales based on an ordinal reference metric using the Leunbach’s model (13). The current study uses Andrich’s RUMM2030 (14) to equate 3 instruments widely applied in health outcome studies, to create an interval scale reference metric, upon which each of the 3 scales are calibrated via the metric, and to test that the reference metric is invariant across age, sex and different health conditions.
The RA set included data for 299 outpatients with RA who responded to questions in the WHODAS 2.0 and the Health Assessment Questionnaire (HAQ) for a previous methodological outcome measurement study (15). The stroke set included data for 188 community-dwelling patients living with stroke who completed the WHODAS 2.0 and the Functional Independence Measure (FIMTM) for a previous validation study (16). For all 3 instruments the validated Turkish versions were adminis-tered (17, 18). Both studies were performed at the Department of Phys-ical Medicine and Rehabilitation, Ankara University Medical Faculty and the ethical approval was given by the Research Ethics Committee of Medical Faculty, Ankara University, study number 127-3559 (for the study related to the RA set) and 136-3990 (for the study related to the stroke set).
The World Health Organization Disability Assessment Schedule (WHODAS 2.0) is a generic disability instrument. The complete version consists of 36 items on functioning and disability within 6 domains: understanding and communicating (6 items), getting around (5 items), self-care (4 items), getting along with others (5 items), life activities (8 items), and participation in society (8 items). Four of the latter relate to school or work situations, and can be omitted if not relevant. Items are scored on a 5-point scale, ranging from 1 = none to 5 = extreme/cannot do. Six domain scores and a total score are available for the evaluation of dimensions of disability and health status; higher scores reflect greater disability (19). WHODAS 2.0 has been tested and used in more than 16 countries, mainly among adults 18 years of age or above. Both classical and modern psychometric analyses have been used to support the validity of the instrument in RA and stroke populations (9). We included only those domains of the WHODAS 2.0 in our analyses that revealed conceptual equivalence with the other instruments included in this study. It is noteworthy that the few previous Rasch analyses conducted of the WHODAS 2.0 in specific health conditions focused on generating a score on the full scale rather than a score at the domain level (20, 21).
With respect to health condition-specific measures, data on the HAQ was collected in patients with RA, and data on the FIMTM in patients with stroke. The HAQ was developed to be used across various rheumatic conditions (22) and has been described as a valid, reliable and responsive measure in the RA population (23). The HAQ consists of 20 items divided into 8 domains: Dressing & Grooming, Arising, Eating, Walking, Hygiene, Reach, Grip, and Activities. All items are rated on a 4-point scale (0 = without any difficulty, 3 = unable to do). The highest score reported by the patient for any question within each domain determines the score for that domain. Subsequently, the mean score of the 8 domains is calculated as the HAQ score in a range of 0–3. In this study, the HAQ was scored without the score adjustment for assistive devices and help. Since the other included PROMs reflect a performance perspective, whereas adjusting HAQ scores attempts a capacity perspective, i.e. trying to ascertain what level of problem the individual would have had without using assistive devices or help, we refrained from the score adjustment.
The Functional Independence Measure (FIMTM) motor scale is a widely used generic assessment tool, which can be used as an outcome measure for the functional status and burden of care in rehabilitation patients (24). The FIMTM also includes a cognitive scale, which was not used in this study. The FIMTM motor scale consists of 13 items, which can be grouped into 3 sub-scales: self-care; sphincter control; and transfer and mobility (25). A 7-level scoring system is used to rate independence in each item, where 1 = complete dependence and 7 = complete independence. Thus, the total score ranges from 13 to 91, where higher scores indicate higher functional independence. Studies of the psycho-metric quality of the FIMTM have shown that it has a high overall internal consistency, adequate discriminative capabilities for rehabilitation patients and some responsiveness, construct validity, and good inter-rater reliability (26, 27). Furthermore, previous Rasch analyses of the FIMTM have shown that there are local dependencies amongst items, which can be absorbed by replacing the dependent items with testlet scores (28).
WHODAS 2.0 was collected in both clinical populations, whereas FIMTM was collected only in people with stroke, and the HAQ only in people with RA.
To establish comparability of existing scales, 2 aspects are important (10). First, to examine the conceptual equivalence of the existing instruments, they were linked to a universal reference framework. The International Classification of Functioning, Disability and Health (ICF) was used in this study, which is the recommended standard set out by the World Health Organization to describe health and disability of individuals and populations, providing an internationally agreed language and structure (2). The ICF Linking Rules, an established method to link existing instruments to the ICF, were applied (25). The current study accessed existing linkings from the ICF Research Branch (www.icf-research-branch.org) in which the first author was involved and which were performed accordingly. The results of the ICF Linking provide evidence for the conceptual equivalence of the identified instruments, which is fundamental for scale equating (10). For this study, we considered the items or sub-sets of items contained in the identified instruments as conceptually equivalent if they were linked to the same ICF chapter.
Secondly, to achieve score equivalence between the scores of the identified instruments, test-equating was undertaken within the Rasch measurement model framework using RUMM2030 (14). Data came from one study population with responses to FIMTM and WHODAS 2.0 and another population with respons-es to HAQ and WHODAS 2.0. For this reason, the data generated 2 sets of ordinal level raw scores that cannot be compared. However, under the Rasch model, the 2 sets of raw scores can be transformed into interval scaled estimates of person parameters that define the basis of a reference metric where scores from the different populations become comparable.
To establish score equivalence, 2 aspects of the study design are important: first, the WHODAS 2.0 was collected in both the RA and stroke population, and thus served as the common scale to link between the 2 data-sets. The scoring of the FIMTM motor scale was reversed so that a low score indicated no problems/high independency and a high score extreme problems/high dependency in all scales. Secondly, during test equating in RUMM2030, the total scores of the scales are equated such that the scale becomes the items (11). The location of a scale is thus the mean of the threshold locations, just as in ordinary partial credit items, except there will usually be more thresholds.
During analysis, the scales (items) are subjected to the usual Rasch analysis procedures, to test whether the data deviates from the Rasch model’s assumptions of item-fit, invariance and unidimensionality. χ2 tests and residuals are used to assess the fit of test scores (items) to the Rasch model. Due to the structural missing data design, only the WHODAS 2.0 was admin-istered to all persons. For this reason, pairwise calibration of the WHODAS 2.0 with the HAQ and with the FIMTM motor scale was conducted before all 3 scales were equated. The pairwise analyses included a Conditional Test of Fit (CTF) of the test scores to the Rasch model to ascertain that the 2 test scores (items) measured the same latent trait. The Benjamini-Hochberg procedure was applied to adjust for multiple testing (29).
Invariance requires that 2 persons with the same trait level yet with different personal or health condition characteristics, such as male and female or condition, have the same probability of achieving a given score on the item. Under the joint model for the 3 instruments, invariance implies that there is no differential item functioning (DIF) (30) relative to age, sex, and health condition tested, in this case, by an analysis of variance (ANOVA) of the residuals.
Local response independence is an important assumption of the Rasch model (31). Items may be locally dependent because of response dependence or because of multidimensionality. For this reason, the analysis by the joint model calculated residual correlations between items (instruments in this case) and tested unidimensionality by paired t-tests comparing of person estimates based on the WHODAS 2.0 + HAQ with person estimates based on FIMTM, and by paired t-tests comparing person estimates by WHODAS 2.0 + FIMTM with person estimates by HAQ. During these analyses, RUMM2030 counts the number of cases where the p-values of the paired t-tests are less than or equal to 5%, and compares this number with the expected 5% of the persons.
Once evidence for score equivalence is established, the metric needs to be defined. In principle, the person parameters of the joint Rasch model could be estimated by outcomes on the separate scores, by outcomes on WHODAS 2.0 + HAQ or WHODAS 2.0 + FIMTM; or by WHODAS 2.0 + HAQ + FIMTM if data on all scores had been collected for some persons. All of these estimates would posit the persons on the same logit scale, with values from minus to plus infinity. However, since many users prefer measures without negative values, it is common to change the origin and the unit of the logit scale so that the range of possible outcomes lies within an interval from zero to a reasonable upper limit. For this reason, we propose a reference metric defined by the possible outcomes of all 3 scales. To change these logits into values with which users will be more comfortable, the origin and the unit of the logit scales were changed in such a way that the values on the WHODAS 2.0 + HAQ + FIMTM raw score transformed to an interval-scaled reference metric from 0 to 100.
In total, the sample consisted of 487 people; 299 (61.4%) with RA and 188 (38.6%) with stroke. In the RA sample 25.4% were male, and in the stroke sample 53.7% were male.
The 3 instruments were linked to the ICF. As shown in Table I, all items of the HAQ and FIMTM motor scale were linked to the ICF chapters d4 Mobility, d5 Self-care and d6 Domestic life, as were the item blocks related to Getting around, Self-care and Life activities of the WHODAS 2.0. The item D3.4 Staying by yourself for a few days of the WHODAS 2.0 and the item Do chores such as vacuuming or yard work of the HAQ were linked to d5 Self-care and d6 Domestic life respectively, rather than to a specific ICF category, since the content of these items was not further specified.
Table I. International Classification of Functioning, Disability and Health (ICF) linking table
Tables II–V show the results of the analyses of fit of WHODAS 2.0, HAQ and FIMTM motor scale to the joint Rasch models for all 3 scales. There are a few significant fit statistics, but significance is generally weak, with p-values between 0.01 and 0.05. After adjusting for multiple testing all hypotheses were accepted, except for the evidence of DIF relative to age during pairwise calibration of WHODAS 2.0 to the other 2 scales, where the adjusted p-values are 0.01. Since the analysis of DIF in the joint model did not provide evidence of DIF relative to age, we accepted the joint Rasch model for the 3 scales and concluded therefore that a common reference metric for WHODAS 2.0, HAQ, and FIMTM is feasible.
Table II. Item (scale)-fit statistics
Table III. Test of item trait interaction and reliability measure by the Person Separation Index (PSI)
Table IV. Analyses of DIF and local dependence
Table V. Tests of unidimensionality
Given the evidence that scale equating is possible, the logit scale was transformed into a scale from 0 to 100. Fig. 1 shows a, so-called, item-map presenting the distribution of the person estimates on the logit scale together with the locations of the items. The targeting of the equated scales was good, with a person mean of –0.417, where the item mean is 0 (Fig. 1). The slight offset to the milder end of functional limitation is driven
largely by the RA sample. Fig. 2 shows the ranges of reference logits that could have resulted by a total WHODAS 2.0 + HAQ + FIMTM score, and the logit values that the separate scales could have delivered.
Fig. 1. Targetting of samples across the reference metric. SD: standard deviation; No.: number.
Fig. 2. Operational range of each scale in relation to the reference metric. WHODAS: World Health Organization Disability Assessment Schedule; HAQ: Health Assessment Questionnaire; FIMTM: Functional Independence Measure.
Table VI shows how to transform raw scores from the scales into the common reference metric. A raw score equal to, respectively, 0, 77 and 154 on
WHODAS 2.0 + HAQ + FIMTM transforms to 0, 42.9 and 100 on the reference metric, while raw scores on WHODAS equal to 0, 26 and 52 correspond to refer-ence values equal to 13.0, 38.3 and 74.0. Note that in the Table IV the FIMTM motor scores were reversed back to the original scoring direction, so that a low score indicates high dependency and a high score, low dependency.
Table VI. Transformation table. A low score on the reference metric indicates no difficulties and a high score extreme difficulties
Table VI cont.
The transformation Table VI allows clinicians and researchers to exchange information collected with the 3 instruments between each other in stroke and RA populations. It is possible to determine what the raw score on one scale would equate to on another scale, and to compare differences between raw scores in a meaningful way. Consider, for instance, the following 3 patients: patient 1 has a raw score of 37 on the summed domains of the WHODAS 2.0 that transforms to a reference metric score of 43.9; patient 2 has a raw score of 14 on HAQ corresponding to a reference metric of 39.5; and patient 3 has 52 on the FIM™ motor scale and therefore a reference score of 49.1. In other words, patient 2 has fewer and patient 3 more problems with functioning than patient 1. Assume, next, that the patient’s condition has improved after rehabilitation and that we want to compare the degrees of improvement. Patient 1 has a raw score of 10 on WHODAS 2.0, patient 2 has a raw score of 4 on HAQ and patient 3 has 75 on FIMTM. Since the scores transform to, respectively, 31.4, 31.8 and 36.5 on the reference metric, we see that patients 1 and 2 are at the same level of difficulties after rehabilitation and patient 3 continues to have more difficulties. The differences in the reference scores are meaningful because the reference scale is an interval scale. These differences show that the improvement in patient 3 (12.6 on the reference metric) is more than twice the improvement in patient 2 (7.7) and marginally larger than the improvement in patient 1 (12.5).
This study provides evidence that it is feasible to create an interval-scaled reference metric across health conditions, using existing generic and health condition-specific disability instruments, and that the reference metric was invariant across RA and stroke. The basic methods applied in this study are not new, although the integration of the ICF Linking Rules with the Rasch measurement model to establish conceptual and score equivalence between instruments has only recently been introduced (32). The ICF Linking Rules provide reference to the international standard for reporting functioning set by the World Health Organization (WHO) and endorsed by various institutes, such as the ISO Standard for Quality Management in Health Care Services (International Organization for Standardization (ISO) 9001: 2015)(33).
From one point of view, the results are similar to those from psychometric test equat-ing of raw scores and, in particular, similar to the analysis of indirect equating (13). In this previously published study for indirect equating, the Leunbach model was utilized, where the model is the joint distribution of 2 power series distributions depending on the same person parameter. This distribution can be rewritten as the distribution of a partial credit item, and the relationship between the power series distribution and polytomous Rasch items (34–37). Thus, Leunbach’s model is nothing but a partial credit model with 2 items and equating at the ordinal level. However, what is innovative in the present study is that the equating was based upon an interval-scaled reference metric, derived by a joint equating model for all 3 scales, enabling a transformation from the separate scores to a joint logit scale. The logit scale is an interval scale. For this reason, the logit scale or a convenient linear transformation of the logits is the natural reference metric, on which to compare test results from the different scales.
Increasingly we are seeing studies that calibrate instruments together, usually within a single diagnosis, or sometimes with a wider focus, across, for example, a musculoskeletal group encompassing several diagnoses, or focussing on a particular symptom, such as pain (38, 39). These recent equating applications have used sample dependent item response theory (IRT) models such as the generalized partial credit model, where person estimates must be meand to provide a transformation table, given each raw score can provide, in theory, vast numbers of different estimates. The critical issue is that any transformation table presented should reflect a calibration model that delivers estimates that are independent of the distribution upon which the calibration is based. Only then, given the same frame of reference (e.g. diagnostic group(s)), can clinicians and others have confidence that those transformations apply to their own sample, involving the same frame of reference. This requires parameter separation between persons and items, which is consistent with applying the Rasch model as in the current study.
Andrich’s approach, of rewriting one of the models for test equating as a partial credit model and using RUMM2030 facilities for Rasch analysis during test equating, is both recent and important for deriving the reference metric (11). Tests of invariance of WHODAS 2.0 would be a challenge for the majority of DIF tests implemented in programmes for IRT and Rasch analy-sis, but was not a problem for the ANOVA analysis of DIF implemented in RUMM2030. Thus, the application of these methods has significant implications for outcome research in the future. The reference metric based on these 3 instruments allows collating of data derived from any of these instruments for different purposes, such as clinical decision-making, bench marking, or meta-analyses. This approach implies that standardization of outcome measurement does not require standardization of the instruments, but rather enables the standardized reporting of functioning outcomes irrespective of the instrument used.
The limitations of this study are consistent with the use of existing data for secondary analysis, where no influence is possible on the initial data collection. The low percentage of males (25.4%) in the RA sample is consistent with the prevalence of RA in men. Furthermore, only data-sets with one health condition group; RA for musculoskeletal and stroke for neurological, were available that included both data on the
WHODAS 2.0 and a health condition-specific instrument. The contextual factors available for DIF analysis are also constrained to only those data shared across the original studies. Note, that in the present analysis the sum-score of each scale was used for analysis. This scoring is in accordance with the traditional scoring of the WHODAS 2.0 and FIMTM. The HAQ scoring is usually different as the highest score within a domain determines the score for the domain. The sum of the domain scores is then divided by the number of domains, and thus results in a score from 0 to 3. Previous research has shown fit of the 20 HAQ items to the Rasch model and emphasized the value of using the full information of all 20 items rather than the highest score of each domain (40). From the perspective of this study, it is a limitation that the HAQ scores are not directly comparable with the HAQ scorings often used in practice. Nevertheless, it has the advantage that the information from all 20 items was maintained, and if one has access to the ratings of each HAQ item, one can use the transformation table provided in this paper. Another limitation is the apparent absence of any evidence of the quality of the equating procedure. It is true that if the data fit the model (e.g. fit statistics, and graphical information), then the accuracy of every other inference to that level of fit, follows. Nevertheless, most recently a “standard error of equating” has been proposed, which shows the accuracy of equat-ing at all score levels (13). Currently confined to 1 software package, it is hoped that this will be taken up elsewhere; for example, in the expanding R-based Rasch procedures. Finally, the pairwise co-calibration of the WHODAS 2.0 and the HAQ resulted in a slightly less than acceptable level of non-error variance re-tained, but the triple calibration of all 3 scales was much more robust, and the resulting transformation tables are based on this latter analysis.
This study provides evidence and a transformation table to enable direct comparisons among instruments commonly used to measure functioning in RA (HAQ) and stroke (FIMTM motor scale), as well as in people with disability in general (WHODAS 2.0). Clinicians, public health experts and researchers are thus support-ed in the continuing use of their existing instruments, and their historical data collections, whilst being able, where necessary, to compare results across health conditions.