Abstract
ABSTRACT. Mortality rates in acute renal failure remain extremely high, and risk-adjustment tools are needed for quality improvement initiatives and design (stratification) and analysis of clinical trials. A total of 605 patients with acute renal failure in the intensive care unit during 1989-1995 were evaluated, and demographic, historical, laboratory, and physiologic variables were linked with in-hospital death rates using multivariable logistic regression. Three hundred and fourteen (51.9%) patients died in-hospital. The following variables were significantly associated with in-hospital death: age (odds ratio [OR], 1.02 per yr), male gender (OR, 2.36), respiratory (OR, 2.62), liver (OR, 3.06), and hematologic failure (OR, 3.40), creatinine (OR, 0.71 per mg/dl), blood urea nitrogen (OR, 1.02 per mg/dl), log urine output (OR, 0.64 per log ml/d), and heart rate (OR, 1.01 per beat/min). The area under the receiver operating characteristic curve was 0.83, indicating good model discrimination. The model was superior in all performance metrics to six generic and four acute renal failure-specific predictive models. A disease-specific severity of illness equation was developed using routinely available and specific clinical variables. Cross-validation of the model and additional bedside experience will be needed before it can be effectively applied across centers, particularly in the context of clinical trials.
Acute renal failure (ARF) in critically ill patients is associated with a distressingly high mortality rate (1–3). Despite improvements in intensive care and dialytic technology, particularly with continuous renal replacement therapies, we have not observed meaningful improvements in patient survival over the past three decades (4–8). In most series, more than 50% of patients with hospital-acquired ARF die before hospital discharge; of those who survive, between 10 and 33% require long-term dialysis (9–11).
Over the past decade, several clinical trials have been conducted, aiming to reduce ARF-associated mortality (12–14). Most of these studies have unfortunately proved unsuccessful, including relatively large, well-designed trials using pharmacologic agents with strong preclinical data (e.g., atrial natriuretic peptide [ANP]). Among the difficulties in design and analysis of clinical trials in ARF are the lack of a standardized definition of ARF, the heterogeneity of ARF, comorbidity and severity of illness directly influencing mortality, and large variations in the process of care.
Generic severity-of-illness scoring systems (e.g., Acute Physiology and Chronic Health Evaluation-II [APACHE II], Simplified Acute Physiology Score [SAPS], and Logistic Organ Dysfunction Score [LODS]) have not discriminated well in most published studies of ARF (15–18). Although several authors have proposed disease-specific indices, most have been derived at single centers (19–22), and few have been validated outside of their original institution (23–24). Moreover, the timing of evaluation (e.g., consultation, initial dialysis procedure) and the population to which the index was applied (e.g., all patients with ARF, only dialyzed patients, etc.) have differed across studies. To prepare for future clinical trials in ARF, it is essential that valid, generalizable models for risk adjustment be developed, both for stratification in patient selection and for covariate adjustment in the event of imbalanced randomization.
We evaluated 851 consecutive cases of ARF in the intensive care unit (ICU) at four university-affiliated hospitals during 1989 through 1995 (1). Among these 851, 166 patients were entered into a randomized clinical trial comparing intermittent hemodialysis (IHD) with continuous renal replacement therapy (CRRT), the results of which are reported elsewhere (25). Incorporating all patients who were followed over the study period, we accumulated a vast array of demographic, clinical, and laboratory data from which a severity of illness index could be developed. We also prospectively collected a series of previously published generic and disease-specific severity of illness scores (vide infra) to compare and contrast them with our own model. Ultimately, our primary goal was to develop a model that would be valid and generalizable across populations with ARF in the ICU, which could later be cross-validated and refined by using data from other clinical sites.
Materials and Methods
Study Cohort
Patients were those who received a nephrology consultation for ARF while in the ICU or who later transferred to the ICU with ARF at four hospitals in Southern California between October 1989 and September 1995. Acute renal failure was defined by using standard laboratory parameters. For patients with no prior history of kidney disease or known laboratory values, acute renal failure was defined either by a blood urea nitrogen (BUN) ≥40 mg/dl or a serum creatinine ≥2.0 mg/dl. For patients with preexisting renal insufficiency (CRI), acute renal failure was defined by a sustained rise in serum creatinine of ≥1 mg/dl compared with baseline. Exclusion criteria included previous dialysis, kidney transplantation, urinary tract obstruction, and hypovolemia responsive to fluids. Informed consent was obtained from all study participants or their next-of-kin. A total of 851 ARF cases were initially evaluated. No information on vital status was available in 31 (3.6%). Of the 820 remaining, data sufficient to calculate the generic (APACHE II [26], APACHE III [27], SAPS II [28], LODS [29], Multiple Organ Dysfunction [MOD] [30], Brussels [31], Sequential Organ Failure Assessment [SOFA] [32]) and disease-specific (see Appendix; Liaño et al. [10], Schaefer et al. [19], ANP Study [24], Lohr et al. [20], and Steuvenberg Hospital Acute Renal Failure [SHARF] [22]) severity of illness scores were available in 605 patients (73.7%), which comprised the analytic sample. Of the 605 patients, 262 (43.3%) were at University of California San Diego Medical Center, 167 (27.6%) were at the San Diego Veterans Affairs Medical Center, 104 (17.2%) were at the Navy Hospital, and 72 (11.9%) at University of California Irvine Medical Center. Baseline vital signs, hemodynamic, and laboratory data were recorded for the first ICU day and each day from the time of nephrology consultation. For this study, variables collected on the day of nephrology consultation were used to compute the severity of illness scores.
Statistical Analyses
We used conventional parametric statistics for the primary data analyses. Continuous variables were described as mean ± SD, and categorical variables were described as proportions. Logistic regression was the primary analytic method employed. Odds ratios (OR) and 95% confidence intervals (95% CI) were derived from model parameter coefficients and standard errors, respectively. Multivariable analyses using backward variable selection (model acceptance criterion, P < 0.05) were conducted to simultaneously adjust for confounding variables. Proportional hazards (Cox) regression was also applied, censoring patients 60 d after their initial consultation, to determine the hazard ratio (equivalent to a relative risk [RR]), associated with the same candidate variables. For the Cox models, plots of log (−log [survival rate]) against log (survival time) were performed to establish the validity of the proportionality assumption 33).
Our model-building strategy proceeded as follows. Explanatory variables (including those describing organ system failure; Table 1) were carefully defined using conventional published criteria (34). Variables collected at time of consultation were examined individually to evaluate whether there was a trend between the variable at issue and in-hospital mortality (P < 0.20). Variables that demonstrated a trend with mortality were placed into one of four main categories: demographic, historical, laboratory, or physiologic. These clusters prevented the inclusion of closely related variables (e.g., total bilirubin, liver failure) in the early model-building process. Variables that continued to show a trend with mortality within cluster (P < 0.10) were considered candidates for a larger multivariable model including variables from within all clusters. Missing variables were handled in the following manner. For demographic and historical variables, missing data were considered to be absent. Although imputation to “absent” would tend to lessen the strength of the association between explanatory variable and outcome, the effect of this assumption was tested by inclusion of a “missing” term in the multivariable regression model. In no case was the “missing” variable significant or nearly so. Similarly, where continuous variables were missing, the mean value was imputed and a “missing” term was included in the multivariable regression model. None of the final model covariates were missing in more than 5% of patients. Competing models were ranked by their −2 log likelihood χ2. Variables excluded during either the initial phase or after cluster analyses were re-entered individually to evaluate for residual confounding, defined as a change of ≥10% in one or more parameter estimates. Discrimination was assessed using the area under the receiver operating characteristic (ROC) curve (35). Calibration was assessed using the Hosmer-Lemeshow goodness of fit test (36). The Hosmer-Lemeshow test compares model performance (observed versus expected) across deciles of risk to test whether the model is biased (i.e., performs differentially at the extremes of risk). A nonsignificant value for the Hosmer-Lemeshow χ2 suggests an absence of such bias. Overall model performance was also assessed with the likelihood ratio. A likelihood ratio indicates the degree to which the pretest probability of an event is altered by information provided by a test, or in this case, a predictive model. The higher a model’s likelihood ratio, the greater the probability of accurately predicting events.
Criteria for organ failurea
The final model was validated using the bootstrapping technique (37,38). This procedure involves sampling (with replacement) and refitting models on 100 distinct 605-patient samples derived from the study population. Model parameter estimates, OR, and 95% CI were recalculated on the basis of the larger variation created by the bootstrapped samples. Ranges of mortality rates and areas under the ROC curve were also calculated.
P < 0.05 was considered statistically significant. All analyses were conducted using SAS Versions 7 and 8 (SAS Institute, Cary, NC).
Results
Patients
Of the 605 patients in the analytic sample, 185 (21.7%) were randomized into a clinical trial (19 in pilot phase, 166 in study proper) comparing IHD and CRRT. Details of the study’s inclusion and exclusion criteria are available elsewhere (25). Two hundred and forty additional patients (28.2%) required some form of dialysis during their ICU stay; 147 (17.3%) received IHD, and 93 (10.9%) CRRT. Approximately half (50.1%) of patients did not undergo dialysis.
Patient characteristics are presented in Table 2, categorized by whether the patient required dialysis and received IHD as the primary dialytic modality or received CRRT as the primary dialytic modality and did not require dialysis during the ICU or hospital stay. For this presentation, dynamic variables (e.g., hemodynamic parameters, laboratory studies) were determined on the day of nephrology consultation. After the predictive model was developed in the entire cohort, it was retested in prespecified subgroups to gauge its generalizability.
Patient characteristics (day of nephrology consultation)a
Correlates of Mortality
Three hundred and fourteen patients (51.9%) died in hospital. Among the demographic variables, age (RR, 1.002; 95% CI, 0.99 to 1.01), gender (RR, 1.40; 95% CI, 0.98 to 1.99), and race were not significantly associated with the odds of in-hospital death on bivariate analysis. Variables associated with the odds of in-hospital death are shown in Table 3.
Bivariate correlates of mortality (day of consultation)a
Multivariable Analyses
Logistic regression was chosen as the primary multivariable analytic method, as it requires fewer assumptions than proportional hazards regression and allows direct comparison of model discrimination and calibration with other generic and disease-specific predictive models. As noted above, closely related variables (e.g., thrombocytopenia and “hematologic failure,” hyperbilirubinemia and “liver failure”) were compared in nested logistic regression models to determine which variable (by χ2) was a better predictor of mortality. Later, the variables not selected within the nested analyses were re-examined in full multivariable models. Table 4 shows the nine variables included in the final logistic regression model. The logistic regression equation is listed below.
Multivariable logistic regression model for mortality in ARF (day of consultation)
Note that age and gender were added to the model because both confound the relation between serum creatinine and renal function and because both may influence the risk for and manifestations of organ failure. When added to the model adjusting for organ failure, BUN, and creatinine, age and gender were significantly associated with the risk of death. The results using proportional hazards regression were similar to those using logistic regression, with similar hazard ratios (RR). In addition to the variables listed above, systolic BP (lower levels associated with increased mortality) was also a significant predictor in the proportional hazards (Cox) model.
Model Validation Using Bootstrapping
We ran 100 (605-patient) bootstrap samples on the data. The number of patients who died in hospital ranged from 283 (46.8%) to 357 (59.0%). The areas under the ROC curve ranged from 0.795 to 0.890. Table 5 shows the calculated OR and 95% CI for each of the variables in the original equation, all of which were validated by the bootstrap methodology. In other words, the model is not likely to be excessively overfit to the existing data set.
Bootstrapped OR and 95% CI (100 sample) of existing model
Comparing Model Performance with Generic and other Disease-Specific Indices
To evaluate the predictive model against other models, including APACHE II, APACHE III, and other widely used generic severity of illness models, we compared the likelihood ratio and area under the ROC curve to determine the explanatory and discriminative power of the new model relative to seven generic models and the Hosmer-Lemeshow goodness-of-fit test to check the models’ calibration (Table 6). We also compared the new model with five ARF-specific models previously applied to patients with ARF before the need for eventual dialysis was determined. Table 7 shows the new model’s performance within prespecified subgroups, including patients with acute versus acute or chronic renal failure, dialysis versus no dialysis, and timing of ARF relative to ICU admission. Finally, Table 8 shows the range, accuracy, and calibration of the model by deciles of risk, showing that the new model accurately predicts mortality risks ranging from <10% to >90%. Figure 1 shows the ROC curves for the new model and generic (panel A) and other disease-specific (panel B) models.
Model performance (day of consultation)a
Performance of model in prespecified subgroups
Observed versus predicted mortality across risk deciles
Figure 1. Receiver operating characteristic (ROC) curves for the Mehta model and other generic (A) and disease-specific (B) predictive models. The y-axis is sensitivity and the x-axis is 1-specificity.
Discussion
Risk adjustment in critically ill patients has proved essential to appropriate clinical management, quality improvement, and health resource utilization. Numerous comorbidity indices and severity of illness scores have been developed, many incorporating “all comers” to an ICU setting, such as the APACHE II and APACHE III scores, the Mortality Prediction Model (MPM), SAPS, and several others (26–28). Although these have proved invaluable to the critical care physician and other members of the healthcare team and have been used extensively for risk adjustment in several clinical trials in ARF (5,12–13,35,36), we and others have shown that these indices fall short of accurately predicting mortality in the subgroup of patients with ARF (16–19). Fiaccadori et al. (15) recently evaluated the performance of the APACHE II, SAPS II, and MPM III models in a single-center cohort of patients with ARF and demonstrated that two of the three generic scoring systems performed relatively poorly, with areas under the ROC curves of 0.75, 0.77, and 0.85 for the APACHE II, SAPS II, and MPM models, respectively. One could conclude that the model derived here (Mehta model) performed better than the other models, as the likelihood ratio was highest and the 95% confidence limits of the area under the ROC curve showed no overlap with generic or disease-specific models (Table 6, column 4). The mediocre performance of generic models can be anticipated, because these models were developed from unselected ICU admissions, only a small fraction of whom had ARF. Moreover, indicators of renal function are among the key predictors in these scores, so that a generic index will typically underperform in a population uniformly affected with ARF. More recently, organ-scoring systems have been developed that use a more qualitative strategy to describe the underlying severity of illness (29–32). Although these scores have been validated in patients with sepsis (39), they have not been previously evaluated specifically in patients with ARF. Although the organ-specific scoring systems performed somewhat better than the APACHE II, SAPS, and MPM models, they were not well calibrated and tended to overestimate mortality (40).
To address these limitations, several investigators have attempted to develop disease-specific (in this case, ARF-specific) indices to improve predictive power. Liaño et al. (2,10) have made major contributions in this area. The Acute Tubular Necrosis Severity Scoring Index (ATN-ISS) was derived and later validated in hospitals in Spain and elsewhere (10,23,41). The ATN-ISS is a linear discriminant model producing a percent likelihood of mortality on the basis of a variety of physiologic and laboratory parameters. Although useful in many investigators’ and clinicians’ hands, we have found the inclusion of the Glasgow Coma Scale score difficult to determine due to effects related to the use of sedative, analgesic, and paralytic agents. Moreover, this severity score has been limited to patients with ATN, who comprise most but not all patients with ARF in the ICU. Paganini et al. (21) developed and subsequently validated a model based on experience in more than 1000 ICU patients at the Cleveland Clinic Foundation (CCF). Although apparently robust, this model has been derived in the subset of ARF patients requiring dialysis, and hence is not applicable to “all comers” with ARF in the ICU. Indeed, the CCF model did not perform as well when applied to the entire population (Table 6) compared with the patients who required dialysis (Table 7). Among the important differences in a dialysis-only severity score and one applicable to patients at an earlier stage of ARF is the association between oliguria and mortality and the need for dialytic support. Indeed, in this study, urine output was among the most important variables. In the model developed by Paganini et al. (21), oliguria is associated with either no difference or a slight decrease in the risk of death relative to non-oliguria. To provide the clinician with the most valuable information, a model that accurately predicts outcome early in the course of ARF is most desirable. This quality would be essential if such a model were to be used for risk stratification in an early therapeutic intervention trial in ARF.
Of the model parameters identified in this study, advanced age and organ systems failure might be expected to be associated with mortality. It is unclear why male gender is associated with a two-fold increase in the risk of death with ARF, although Paganini et al. (21) and Chertow et al. (24) found a similar magnitude of excess risk among men. It is unlikely that it is related to the fact that creatinine generation is generally higher in men and that nephrology consultation may be more likely with higher serum creatinine (42). If this were the case, then the average severity of renal injury in men would be less than that in women at a given serum creatinine concentration; therefore, one might expect a higher risk of death in women with ARF. Although BUN and creatinine are both metabolic waste products that accumulate in ARF and generally track together, they exhibit opposite associations in this model, with higher BUN and lower creatinine concentrations being associated with an increased risk of death. Paganini et al. (21) noted a similar pattern in the CCF cohort. Higher BUN may be associated with increased protein catabolism, a subtle sign of metabolic stress. Low serum creatinine, particularly after adjustment for age and gender, probably reflects loss of muscle mass; however, it could also be related to volume overload or inflammation, whereas BUN may be affected by additional factors (e.g., gastrointestinal bleeding, nutritional supplementation, and corticosteroid use), potentially overcompensating for the volume-related effect.
There are several important limitations to this study. First, resources were insufficient to allow full collection of data required for calculation of severity scores in all 851 patients. However, there were no significant differences in demographic characteristics or comorbid conditions in patients who were not included in these analyses (data not shown), suggesting that no major bias was introduced. Second, we were limited by the frequency of measurement of some physiologic and laboratory variables that might be significantly associated with mortality. For instance, elevated pulmonary capillary wedge pressure might be associated with ARF, but it was not measured in the majority of patients. Indeed, the inclusion of infrequently observed yet potent risk factors (e.g., ventricular fibrillation) may result in improved discrimination of a model but also increases the risk of the model being overfit, rendering it less useful for broader use, as in a clinical trial. Third, data on body weight (as an estimate of volume status) were not always available. Volume status influences the concentration of BUN and creatinine (43) and may itself be a predictor of outcome (44); therefore, its absence likely reduced the predictive power of our model. Finally, by definition, this model must be somewhat overfit to the current data set. Although the performance of the model relative to generic and other ARF-specific models was superior, cross-validation of the model will be required to securely demonstrate its validity. Such an endeavor is underway using data collected prospectively from five academic medical centers that comprise the PICARD (Project to Improve Care in Acute Renal Disease) consortium.
In summary, using data collected from more than 600 patients with ARF in the ICU, we developed a regression model predicting in-hospital mortality using only readily available clinical data. Included among the predictor variables were advanced age, male gender, respiratory, liver, and hematologic failure (determined by specific criteria), diminished urine output, elevated BUN, diminished creatinine, and elevated heart rate. This model was superior to generic and other ARF-specific models. This and other models like it will be essential tools in the design and implementation of future clinical trials in critical care nephrology. Cross-validation of this model or its next generation will be required to accurately assess its performance relative to other risk adjustment tools.
Appendix
University of California San Diego Medical Center Hospitals (UCSDMC, A 450-bed tertiary care hospital with Level 1 Trauma facility; Thornton Hospital, A 120-bed tertiary care hospital); US Naval Medical Center (a 700-bed tertiary care hospital providing medical care to active duty personnel, their dependents, and retirees in the San Diego area); Veterans Administration Medical Center, San Diego, California (a 355-bed tertiary care center providing a broad spectrum of programs and services for veterans residing in San Diego, Imperial, and the southern portion of San Bernadino counties); University of California Irvine Medical Center, Irvine, California (a 400-bed tertiary care level 1 trauma facility serving Orange County and vicinities).
Acknowledgments
We thank the members of the Collaborative Group for Treatment of ARF in the ICU comprised of University of California San Diego Medical Center, San Diego, California; U.S. Naval Medical Center, San Diego, California; Veterans Affairs Medical Center, San Diego, California; and University of California Irvine, Irvine, California for providing us with the data for analysis. We also acknowledge the help of the PICARD group for their thoughtful review of this manuscript. Members of the PICARD group include: Dr. Emil Paganini, Dr. Jonathan Himmelfarb, Dr. T. Alp Ikizler, Dr. Tom Greene, Stephanie Freedman, Susan Robertson, Michelle Garcia, Tracy Siefert, Cita Gruta, Karen Wallenfelsz, Tiffany Buchanan, and Rachel Manaster. This study was presented in abstract form for the 33rd Annual Meeting of the American Society of Nephrology, Toronto, Ontario, October 2000. The work was supported by the National Institutes of Health, National Institute of Diabetes, Digestive, and Kidney Diseases (NIH-NIDDK RO1-DK53412-0).
- © 2002 American Society of Nephrology