Abstract
Lupus nephritis (LN) is one of the most prevalent and serious complications of SLE, with significant effects on patient and renal survival. Although a large number of genetic variants associated with SLE have been identified, biomarkers that correlate with LN are extremely limited. In this study, we performed a comprehensive sequencing analysis of the whole MHC region in 1331 patients with LN and 1296 healthy controls and validated the independent associations in another 950 patients with LN and 1000 controls. We discovered five independent risk variants for LN within the MHC region, including HLA-DRβ1 amino acid 11 (Pomnibus<0.001), HLA-DQβ1 amino acid 45 (P<0.001; odds ratio, 0.58; 95% confidence interval, 0.52 to 0.65), HLA-A amino acid 156 (Pomnibus<0.001), HLA-DPβ1 amino acid 76 (Pomnibus<0.001), and a missense variant in PRRC2A (rs114580964; P<0.001; odds ratio, 0.38; 95% confidence interval, 0.30 to 0.49) at genome-wide significance. These data implicate aberrant peptide presentation by MHC classes 1 and 2 molecules and sex hormone modulation in the development of LN.
SLE is a prototypical autoimmune disease characterized by autoantibody production, complement activation, and multiorgan damage.1 Lupus nephritis (LN) is one of the most serious complications of SLE, and the cumulative incidence of LN is higher in Asians and Africans (51%–55%) compared with whites (14%).2 Up to 25% of patients with LN will progress to ESRD within 10 years of presentation with renal disease. This has significant socioeconomic implications for patients, their families, and the health care system.3 Although an increasing number of genetic variants associated with SLE have been identified over the last decade, robust biomarkers predictive of LN do not currently exist. Identification of such predictive biomarkers will enable better evaluation of disease activity and facilitate personalized intervention to improve treatment outcomes.4
Previous genome-wide association studies (GWASs) in SLE have shown strong associations within the MHC region, particularly in the class 2 region.5–16 However, because of extensive linkage disequilibrium (LD) within the MHC region, it has proven challenging to identify the underlying causal variants of these associations within MHC. Two fine mapping studies in European populations identified multiple independently associated loci in the MHC region; however, there was little consistency in the loci identified across the two studies, although both studies did report DRB1*0301 as the most significantly associated HLA allele in SLE.14,15 More recently, another imputation-based study in a Korean population identified a significant association of amino acid position 13 of HLA-DRβ1 with SLE16; however, the effect of changes outside DRB1 remains unclear. A major limitation of these fine mapping studies is that only a small proportion of available single-nucleotide polymorphisms (SNPs) and HLA alleles was typed or imputed, thus strongly restricting the statistical power to localize true casual variants. In addition, the imputation of low-frequency SNPs and HLA alleles remains challenging, particularly for HLA class 2 genes, such as HLA-DRB1 and HLA-DQB1.17 High-coverage targeted sequencing of the MHC region enables interrogation of almost all SNPs and classic HLA alleles with high accuracy.18
To identify independent risk–conferring variants within the MHC region, we applied targeted sequencing of the whole MHC region in 1331 patients with LN and 1296 controls, with validation analysis in 950 patients with LN and 1000 controls in this study (Table 1), with the assumption that the susceptible variants are substantially enriched in this relatively homogenous subtype of SLE.
Summary of samples in targeted sequencing and validation studies
Results
Discovery and Validation Study
The extended MHC region (chr6: 28.6–33.6 Mb) was sequenced in 1331 patients with LN and 1296 controls through sequence capturing and next generation sequencing as described in Concise Methods. On average, the extended MHC region was sequenced to a mean coverage of 49×, with 93.98% covered by at least one read and 85.40% covered by at least eight reads (Supplemental Table 1). We identified a total of 99,047 SNPs, of which 14.14% were of low frequency (0.5%≤MAF<5%) and 58.07% were rare (MAF<0.5%) (Supplemental Figure 1). The classic alleles of 20 HLA genes were inferred directly from sequencing reads using a de novo assembly method established by the investigators in a previous study,18 and an overall call rate of 99.29% for HLA alleles at four-digit resolution was achieved (Supplemental Table 2). To assess the accuracy of HLA alleles inferred from the sequencing data, the classic alleles of five genes (HLA-A, -DRB1, -DQA1, -DQB1, and -DPB1) were typed by amplicon sequencing in a subset of 200 samples. Direct comparison of the two methods gave a concordance of 99.12% at four-digit resolution (Supplemental Table 3).
We performed a single-marker association test of 41,526 SNPs, 189 HLA classic alleles, and 1557 HLA amino acid variants (all with frequency ≥0.5%) with disease status. To account for population stratification, we separated our samples into two homogenous genetic clusters (Supplemental Figure 2), then performed association testing in two clusters separately, and finally, combined the association data. Consistent with the previous studies of SLE,14,15,19,20 we observed extensive associations across the MHC region, with the strongest associations located in the DRB1-DQA1-DQB1 locus (Figure 1, Supplemental Figure 3). The most significant associations were amino acid position 11 of HLA-DRβ1 (Pomnibus<0.001) and SNP rs79372730-G (P<0.001; odds ratio [OR], 2.29; 95% confidence interval [95% CI], 2.00 to 2.63). They were statistically equivalent and displayed strong LD (D′=0.99). To search for other independent associations across the MHC region, we carried out stepwise logistic regression including SNPs, HLA alleles, and amino acid variants in a unified framework. In addition to position 11 of HLA-DRβ1, rs114580964 (Padjusted<0.001), position 96 of HLA-DPβ1 (Padjusted<0.001), position 55 of HLA-DQβ1 (Padjusted<0.001), and position 156 of HLA-A (Padjusted<0.001) were identified as candidate independent associations for LN (Figure 1). In addition, we tested a number of alternative forward searching strategies, which confirmed that these five variants exhibited the strongest association in the MHC region with LN (Supplemental Table 4).
Five independent associations were identified after stepwise association analysis within the MHC region at the discovery stage. (A) The most significant association was mapped to amino acid position 11 in HLA-DRβ1 (Pomnibus<0.001). (B) After adjusting for position 11 in HLA-DRβ1, an independent SNP (rs114580964) within PRRC2A was identified. (C) Adjusting for these two variants identified position 96 in HLA-DPβ1. (D) After adjusting for position 11 in HLA-DRβ1, rs114580964, and position 96 in HLA-DPβ1, another independent signal position 55 in HLA-DQβ1 was found. (E) Adjusting for position 11 in HLA-DRβ1, rs114580964, position 96 in HLA-DPβ1, and position 55 in HLA-DQβ1, an independent signal at HLA-A (position 156) was identified. (F) There was no signal retained after adjusting for all of the loci above. The dashed horizontal line represents a P value threshold of <0.001 for suggestive significance.
To validate the five independent associations identified in the discovery stage and further determine the HLA alleles or amino acid positions that putatively drive the associations in the MHC region, we analyzed the five HLA genes (HLA-A, -DRB1, -DQA1, -DQB1, and -DPB1) and rs114580964 by amplicon sequencing in an independent cohort of 950 patients with LN and 1000 controls, in which the controls were matched to the patients in terms of sex, ethnicity, and age (Supplemental Tables 5 and 6, Table 2).
Association results for independent variants in discovery, validation, and combined studies
Two Amino Acid Positions Drive the Association in HLA-DRB1, -DQA1, and -DQB1
We identified 11 HLA classic alleles with associations that exceeded genome-wide significance (P<0.001) in the joint analysis of the discovery and validation samples (three in DRB1, three in DQA1, four in DQB1, and one in DPB1) (Supplemental Table 5). The separate associations of these 11 alleles in the discovery and validation samples were also highly significant (all of the P values are <0.001) (Supplemental Table 5 shows more details). Because of the tight LD among the classic alleles of DRB1, DQA1, and DQB1 (Supplemental Figure 4), we were not able to unambiguously assign causality to an individual allele. Intriguingly, we did not observe an association of DRB1*03:01 with LN in this study (P=0.05), although the association of DRB1*03:01 with SLE has been consistently reported in multiple European populations.14,15,19,20 This could not be attributed to a lack of statistical power, because our study had sufficient power (99.92%) to detect the association if the risk effect was comparable between SLE in the European population and LN in Chinese Han.
We further sought to evaluate whether the extensive associations of the HLA alleles identified might be driven by specific amino acid positions, particularly positions involved in antigen binding. At each amino acid position, the marginal significance of each amino acid and the overall significance of all residues were evaluated in the discovery and validation samples separately and then in combined analysis (Supplemental Table 6). Within the DRB1-DQA1-DQB1 locus, position 11 of HLA-DRβ1 (Pomnibus<0.001) was most significantly associated with LN followed by position 45 of HLA-DQβ1 (Padjusted<0.001) after adjusting for position 11 of HLA-DRβ1 (Figure 2).
Two independent amino acid positions were identified after stepwise association analysis for amino acids in HLA-DRβ1, HLA-DQα1 and HLA-DQβ1 in the combined analysis. (A) Amino acid position 11 in HLA-DRβ1 showed the strongest association with LN (P<0.001). (B) Adjusting for position 11 in HLA-DRβ1, position 45 in HLA-DQβ1 was significantly associated with LN (P<0.001). (C) After adjusting for position 11 in HLA-DRβ1 and position 45 in HLA-DQβ1, no amino acid was significant (P>0.001). (D) Distribution of deviance improvement for all combinations of two amino acid positions. The model including position 11 in HLA-DRβ1 and position 45 in HLA-DQβ1 is the best fit model for LN.
Of the six possible amino acids at position 11 of HLA-DRβ1, residue Pro11 was a strong risk allele for LN (P<0.001; OR, 2.14; 95% CI, 1.93 to 2.36) (Supplemental Figure 5, Table 2), whereas Ser11 (P<0.001; OR, 0.72; 95% CI, 0.66 to 0.78) and Val11 (P<0.001; OR, 0.64; 95% CI, 0.56 to 0.74) both conveyed a protective effect (Supplemental Figure 5, Supplemental Table 6). Positions 13, 96, 133, and 142 of HLA-DRβ1 also showed similar significance but were not independent of position 11 (D′>0.93; Padjusted≥0.02). The separate associations of the six amino acids at position 11 were highly consistent between the discovery and validation samples (Supplemental Figure 5). Positions 11 and 13 are located on the β-sheet floor within the antigen binding groove of the HLA-DR protein (Figure 3), whereas positions 96, 133, and 142 are outside (Supplemental Figure 6), suggesting that positions 11 and/or 13 are the likely causal variants underlying the association of DRB1.
Key amino acid positions identified by association analysis are highlighted in three-dimensional ribbon models for the HLA-DR, HLA-DQ, HLA-A and HLA-DP proteins. All protein structures are positioned to accommodate the view of the peptide binding groove and the associated amino acid residues using UCSF Chimera. (A) Three-dimensional ribbon model for the HLA-DR protein with positions 11 and 13 located on the β-sheet floor within the antigen binding groove. (B) Three-dimensional ribbon model for the HLA-DQ protein with position 45 in the β-turn of the HLA-DQ protein. (C) Three-dimensional ribbon model for the HLA-A protein with position 156 located in the β-sheet floor in the peptide-binding groove. (D) Three-dimensional ribbon model for the HLA-DP protein with position 76 in the β-sheet floor.
There are two amino acid residues (Glu and Gly) at position 45 of HLA-DQβ1, where Glu45 was protective for LN (P<0.001; OR, 0.58; 95% CI, 0.52 to 0.65) (Supplemental Figure 5, Supplemental Table 6, Table 2). Position 55 of HLA-DQβ1 showed similar significance with the adjustment of position 11 of HLA-DRβ1 (Padjusted<0.001) (Figure 2) but was almost equivalent to position 45 in terms of association, because they are in tight LD (R2=0.39; D′=1). The associations of amino acids at positions 45 and 55 in the discovery and validation samples were very consistent (Supplemental Figure 5). Position 45 is located in the β-turn of the HLA-DQ protein, which connects the α-helix and β-sheet, thereby maintaining the conformational structure of the antigen binding groove, whereas position 55 is located on the α-helix of the HLA-DQ protein (Figure 3).
After adjusting for these two amino acid positions, no other positions showed genome-wide significance in HLA-DRβ1, HLA-DQα1, and HLA-DQβ1 (Figure 2). In addition, we tested all possible combinations of two amino acid positions in these three genes and found that no combination had a better logistic regression model fit than the combination of position 11 of HLA-DRβ1 and position 45 of HLA-DQβ1 (Figure 2). Furthermore, the six haplotypes defined by these two amino acid polymorphisms nearly perfectly distinguished all of the identified risk and protective alleles (Table 3). These results imply that amino acid position 11/13 of HLA-DRβ1 and position 45/55 of HLA-DQβ1 drive the identified associations within the class 2 region.
Associations of haplotypes defined by two amino acids in position 11 of HLA-DRβ1 and position 45 of HLA-DQβ1
Independent Associations in HLA-A and HLA-DPB1
After adjusting for the effects of position 11 in HLA-DRβ1 and position 45 in HLA-DQβ1, the most significant association was mapped to position 156 in HLA-A (Pomnibus<0.001; Padjusted<0.001). There are four possible amino acid residues at this position. Leu156 was found to be protective for LN (P<0.001; OR, 0.78; 95% CI, 0.72 to 0.85), and Trp156 increased the risk of LN (P<0.001; OR, 1.33; 95% CI, 1.15 to 1.54) (Supplemental Figure 5, Supplemental Table 6). The independent association of HLA-DPB1 (from HLA-DRB1, -DQB1, and -A) was mapped to amino acid position 76 (Pomnibus<0.001; Padjusted<0.001) in the combined cohorts. There are three amino acid residues at this position, where Ile76 increased the risk of LN (P<0.001; OR, 1.51; 95% CI, 1.30 to 1.76), and Met76 was protective for LN (P<0.001; OR, 0.76; 95% CI, 0.68 to 0.85) (Supplemental Figure 5, Supplemental Table 6). Position 156 in HLA-A and position 76 in HLA-DPβ1 are both located in the β-sheet floor in the peptide binding groove of their corresponding proteins (Figure 3).
Independent Associations in PRRC2A (BAT2)
The association at rs114580964 (p.L1980I) in PRRC2A was also confirmed by the validation study (Pcombined<0.001; OR, 0.38; 95% CI, 0.30 to 0.49) (Table 2) and is independent of amino acid position 11 of HLA-DRβ1 and position 45 of HLA-DQβ1 (Padjusted<0.001). This variant is in high LD with multiple variants across the MHC region that exhibited similar significance in association tests (Figure 1), and some of these are within or near to known susceptible genes reported in previous SLE studies (e.g., HLA-B, HLA-C, MICB, NCR3, AIF1, CREBL1, TNXB, NOTCH4, and C6orf10).11,14,15,20–22 Although rs114580964-A is predicted to be a damaging coding variant by both SIFT and PolyPhen-2, we cannot conclude that this variant is pathogenic for LN without further direct functional analysis.
Comparison of Associations within the MHC Region between LN and SLE
In a previous GWAS, amino acid variants at positions 11, 13, and 26 of the HLA-DRB1 locus were reported as being responsible for the majority of SLE-MHC associations in a Korean population with SLE.16 We compared the associations of amino acid residues at these three positions in our LN study and the Korean SLE study and found that the associations were highly correlated16 (Pearson correlation coefficients were 0.77, 0.82, and 0.69 for positions 11, 13, and 26, respectively) (Supplemental Figure 7). However, we did not observe similar correlations at the same three positions when we compared our LN study with a study of SLE in a European population15 (Pearson correlation coefficients were 0.53, 0.47, and 0.23 for positions 11, 13, and 26, respectively). Nevertheless, we found that Pro11 (Pro11 and Arg13 are in perfect LD in our study) was the dominant risk residue for all three studies (Supplemental Figure 7), indicating that Pro11 could be a common causal variant for SLE in different populations.
Association Analyses of Clinical Phenotypes
We further investigated the associations of the five risk variants identified in this study with clinical phenotype in 967 patients with LN (Supplemental Tables 7 and 8). After Bonferroni correction (0.05/483=1.04×10−4), we found that Arg55 in HLA-DQβ1 (P<0.001; OR, 0.53; 95% CI, 0.41 to 0.68) and Pro11 in HLA-DRβ1 (P<0.001; OR, 0.54; 95% CI, 0.41 to 0.71) were associated with a lower frequency of positivity for serum anticardiolipin antibody IgM; Pro11 in HLA-DRβ1 was associated with a lower frequency of positivity for serum anticardiolipin antibody IgG (P<0.001; OR, 0.60; 95% CI, 0.47 to 0.77), and Leu55 in HLA-DQβ1 was associated with low serum complement component 4 (C4) level (P<0.001; OR, 1.90; 95% CI, 1.41 to 2.56). Serum anticardiolipin antibody IgM and IgG are both markers of increased risk of thrombotic events in patients with SLE,23 and low complement C4 level is associated with increased renal disease activity in patients with SLE.24
Discussion
This study is the first comprehensive sequencing analysis to identify causal variants for the extensive association signals previously reported across the MHC region in LN. We have identified five functional variants (amino acid position 11 in HLA-DRβ1, position 45 in HLA-DQβ1, position 156 in HLA-A, position 76 in HLA-DPβ1, and rs114580964 in PRRC2A) that independently modulate LN risk. Amino acid position 11 in HLA-DRβ1 and position 45 in HLA-DQβ1 explain 1.50% and 0.91% of the observed variance, respectively, and collectively, all five susceptibility variants explain 3.06% of variance in this study (Supplemental Table 9). Although position 11 in HLA-DRβ1 has been identified as a risk-conferring variant for SLE susceptibility in a previous study,16 our study is the first to show its significant association with development of antiphospholipid antibodies. In addition, our study is the first to identify associations of three additional amino acid positions in HLA genes and a missense variant in PRRC2A with LN.
HLA class 2 alleles have been repeatedly shown to display strong associations with SLE across the MHC region, particularly in the DRB1-DQA1-DQB1 locus.6,13–16,20 In this study, we show that associations with HLA-DRB1, HLA-DQA1, and HLA-DQB1 alleles are likely to be driven by amino acid position 11 in HLA-DRβ1 and position 45 in HLA-DQβ1. Compared with a previously published GWAS of SLE in a Korean cohort,16 in which only associations at amino acid positions 11, 13, and 26 of HLA-DRβ1 were identified, our study provides a complete picture of the independent effects of multiple loci within the DRB1-DQA1-DQB1 region, emphasizing the superiority of sequencing- over genotyping-based methodologies in fine mapping of the MHC region.
The four HLA amino acid variants identified in this study all participate in antigen binding, suggesting that conformational changes in HLA molecules, as a consequence of these variants, are likely to influence peptide presentation to T cells, thus modulating the risk of LN. The comparison of amino acid associations at position 11/13 of HLA-DRβ1 in studies of SLE in European15 and Korean populations16 suggests that Pro11/Arg13 is the common causal risk factor for SLE. This amino acid position, 11/13 in HLA-DRβ1, has also been associated with development of rheumatoid arthritis,25,26 type 1 diabetes,27 and follicular lymphoma28; however, the effect of these amino acid changes varies significantly across diseases. It is well recognized that DNA/nucleosomes and antidouble-stranded DNA antibodies are involved in the pathogenesis of LN,29 although the responsible peptide or core sequence has not yet been determined. Exploring the effect of the amino acid variants identified in this study on the crystal structure of the HLA molecules, the binding of LN-specific antigens by HLA and the T cell responses to antigens presented by HLA are likely to provide invaluable information about the molecular pathogenesis of LN. This information will hopefully explain why amino acid changes in the position 11 of HLA-DRβ1 have such a heterogeneous effect among different autoimmune diseases.
The rs114580964 encodes a missense variant in PRRC2A, leading to a change of leucine to isoleucine at position 1980 (p.L1980I) that may affect protein properties. PRRC2A is highly expressed in the hypothalamus,30 which is involved with regulation of most endocrine organs, and mainly affects hypothalamic-pituitary-adrenal and hypothalamic-pituitary-gonadal axes.31 In addition, this gene has been reported to be associated with the age at menopause,32 ER- and PR-positive breast cancer,33 and type 1 diabetes mellitus,34 indicating a role for PRRC2A in sex hormone regulation and immune inflammation. A predominance in women and a low risk of SLE development in prepubertal or postmenopausal women also suggest a role for estrogen in disease predisposition. However, because of the complex LD around this gene, further functional studies are necessary to uncover the potential role of PRRC2A in development of LN.
In conclusion, sequence analysis of the whole MHC region in a large LN cohort has led to the identification of five independent risk-conferring variants in the MHC region, which significantly advance our understanding of the genetic basis of LN. These risk variants support roles for aberrant peptide presentation by MHC classes 1 and 2 molecules and sex hormone dysregulation in the pathogenesis of LN.
Concise Methods
Subjects
The targeted sequence analysis involved 1331 patients (936 patients from southern China and 395 patients from northern China) and 1296 controls (879 controls from southern China and 417 controls from northern China). For the validation study, two independent patient-control samples were recruited from southern (664 patients and 701 controls) and northern (286 patients and 299 controls) China.
Over 48% of the patients with LN were recruited at The First Affiliated Hospital, Sun Yat-sen University (521 patients), the Peking University First Hospital, Peking University (333 patients), the RuiJin Hospital, Shanghai Jiao Tong University (150 patients), and the West China School of Medicine/West China Hospital, Sichuan University (96 patients), the largest hospitals in southern, northern, eastern, and midwest China. The other patients were referred from hospitals in other provinces. Controls were recruited from the same geographic region. All participants were self-reported as Han Chinese.
All patients with LN met the revised criteria for the classification of SLE from the American College of Rheumatology,35 and they were biopsy proven and histopathologically classified according to the International Society of Nephrology (ISN) and the Renal Pathology Society (RPS) classification of LN in 2003.36 In accordance with the ISN/RPS classification of LN, our samples were classified into six pathologic patterns. Clinical information was collected from 967 patients in the discovery stage and 606 patients in the validation stage. Information collected at diagnosis included ISN/RPS classification of LN, CKD stage, rash, oral ulcers, arthritis, serositis, neurologic disorder, hematologic disorder, antinuclear antibody, antidouble-stranded DNA antibody, anti-Smith antibody, anticardiolipin antibodies IgM, anticardiolipin antibodies IgG, SLE disease activity index score, proteinuria (24 hour), IgA, IgM, IgG, complement component 3, C4, and GFR. All healthy controls had no history of renal disease as well as a negative urinalysis (absence of red blood cells and protein in urine) and normal serum creatinine levels. Sex and age information were collected from both patients and controls through questionnaires. Patients and controls were matched for geographical origin. The baseline characteristics of the subjects are summarized in Table 1.
The study was approved by the Institutional Review Board at The First Affiliated Hospital of Sun Yat-sen University and conducted according to the principles of the Declaration of Helsinki; informed consent was collected from each participant in this study.
Targeted Sequencing of MHC and Immune-Related Genes
MHC capture technology developed by BGI and Roche NimbleGen was applied in the discovery stage to perform deep and near-complete sequencing of the MHC region.37 Approximately 4.97 Mb of the genomic region was targeted by capture probes, including 3.37 Mb of the core MHC region and 1.6 Mb of the extended region. A customized capture kit to enrich for coding sequences of approximately 700 immune disease–related genes from the NHGRI GWAS Catalog, including 38 immune diseases, such as SLE, rheumatoid arthritis, psoriasis, and systemic sclerosis, was also used (data not shown); 1 μg genomic DNA was fragmented and hybridized to the capture probes following the manufacturer’s protocols (Roche NimbleGen). The resulting libraries were sequenced on the Illumina HiSEquation 2000 platform to generate paired end reads of 90 bp.
Sample Filtering
We performed stringent QC of sequencing data at an individual level; 40 samples were not well sequenced in the discovery stage (Supplemental Table 10). We also checked for inbreeding and cryptic relatedness in the discovery samples using Plink (Supplemental Figure 8). Principal component analysis was performed using smartpca,38 and outlier samples for the first ten principal components were removed. In total, we removed 122 samples. The remaining 2505 were used in association analysis (Supplemental Table 10).
Variant Calling and Filtering in the Discovery Stage
We used a multisample approach to detect SNPs from sequencing data.39 We generated the genotype likelihood files using samtools (version 0.1.8). For each position, the maximum likelihood of data assuming that it was an SNP and the likelihood of data under a null monomorphic model were estimated from genotype likelihoods. Then, a log likelihood statistic was calculated. We defined a log likelihood statistic cutoff of 20 and obtained a primary set of 145,350 SNPs. For all of these primary SNPs, the individual genotypes were imputed from sequencing data using BEAGLE with default parameters. The imputed genotypes are more accurate than genotypes called directly from sequencing data, especially for SNPs with low sequencing depth, because the LD information of neighboring variants was incorporated into the imputation model. To obtain a set of SNPs with high-confidence genotype calls, a series of quality filters including sequencing depth, mapability score, strand bias, allelic balance, length of homopolymer run, base quality, and Hardy–Weinberg test, was applied to the primary SNPs (Supplemental Table 11). After stringent filtering, 99,047 MHC SNPs were retained for association analysis.
HLA Typing in the Discovery Stage
We adopted a validated method18 on the basis of local de novo assembly using bam files as input to type 20 genes (including six HLA genes in MHC class 1 region, ten HLA genes in MHC class 2 region, and four non-HLA genes) recorded in the IMGT/HLA database. Finally, a total of 392 four-digit or higher resolution alleles, 757 amino acid positions, and 1723 amino acid variants were obtained.
SNP Genotyping and HLA Allele Typing in the Validation Stage
Nine SNPs and five HLA genes were selected for replication. We performed amplicon sequencing to determine the genotype of nine SNPs and classic alleles of DQA1 and DPB1, combining WaferGen Smart chip with Illumina Miseq. The primers were designed with Primer3 and checked by Primer-BLAST. For the two HLA genes, seven and nine pairs of primers were designed to amplify the first four exons of DQA1 and DPB1, respectively. Paired end reads produced by MisEquation (250 bp per end) were filtered and aligned to the human reference genome (hg18) using BWA mem40 with the default parameters. The SNP genotypes were called by GATK. For each SNP, a scatter plot of read number and the fraction of reads from the alternative allele of each individual were drawn to check the quality of genotyping manually. In total, eight SNPs were successfully genotyped in the validation stage. The consensus sequence and the classic alleles of DQA1 and DPB1 were called for each individual, because each exon was sequenced in its full length. The allele types of HLA-DRB1, HLA-DQB1, and HLA-A were determined using a validated method.18
Association Analyses of SNPs, Amino Acids, and HLA Alleles
An additive logistic regression model was applied to evaluate the association of a specific allele at a biallelic or multiallelic locus. Explicitly, for a biallelic locus, we used one variable coding the dosage of the minor allele in the logistic model to assess its effect; for a multiallelic locus, we collapsed it as a series of biallelic loci and then tested the association of each allele separately. To evaluate the overall association of a multiallelic locus of k alleles, the most frequent allele in the controls was used as the reference allele, and k−1 variables coding the dosage of the k−1 alternative alleles were included in the logistic regression model (e.g., variable xi codes two if an individual carries two copies of the alternative allele i). The improvement of the model fit (defined as −2× the log likelihood) over the null model that followed a chi-squared distribution with k−1 degrees of freedom was calculated to assess the overall P value of the locus.
In the discovery stage, we used the top five principal components to account for population stratification in the association analysis, which achieved a genomic inflation factor of 1.1. Then, we separated our samples into homogenous genetic clusters by K-means clustering using 10,096 common and independent non-MHC SNPs (data not shown) on the basis of the top 20 principal components by R package (vegan). At K=2, the genetic clustering was largely consistent with geographic location of Chinese Han (Supplemental Figure 2), with a resulting inflation factor of 1.037. Finally, we combined the effects of two clusters on the basis of the inverse variance model using meta-analysis. Logistic regression modeling–based allele-dosing (additive) effects were used to assess the allelic risk of each cluster in the MHC region.
In the validation stage, we performed association tests in samples from northern and southern China separately and then combined the results as we did in the discovery stage. The heterogeneity of the association between the southern and northern samples as well as the discovery and validation studies was measured using the statistics I2 and Cochran Q.
Stepwise Association Analyses and Variance Explained for Candidate Variants
To test whether the association of a locus B with k alleles is independent of locus A with m alleles, we assessed the improvement of the fit of the logistic regression model with k−1 variables coding the dosage of the alternative alleles of locus B and m–1 variables for locus A versus the model where only m–1 variables coding the dosage of the alternative alleles of locus A were included. The resulting deviance approximates chi-squared distribution with k−1 degrees of freedom, and thus, the adjusted P value was evaluated.
A method proposed by So et al.41 was used to estimate variance explained in our study. This method converts the OR to relative risk and estimates the variance explained by a single-nucleotide variation under a liability threshold model. The prevalence of LN was set to 0.028% according to a previous study42 in the calculation.
Disclosures
None.
Acknowledgments
We thank all of the subjects and healthy volunteers who participated in this work. We thank the staff of The First Affiliated Hospital of Sun Yat-sen University for help with the sample preparation as well as the team at BGI-Shenzhen for help with the sample genotyping. We also thank J. Flint for reviewing the manuscript and providing constructive suggestions as well as Dr. Jonathan Barratt and Dr. Cijiang He for re-editing of the revised manuscript.
This study was funded by Special Project on the Integration of Industry, Education and Research of Guangdong Province grant 2010A090200075; National Basic Research Program of China 973 program grants 2011CB809201, 2011CB809202, and 2011CB809203; Guangdong Department of Science and Technology Translational Medicine Center grant 2011A080300002; Specialized Research Fund for the Doctoral Program of Higher Education of China grant 20130171130008; Science and Technology Planning Project of Guangdong Province, China grant 2013B051000019; and Guangzhou Committee of Science and Technology, China grant 2012J5100031.
Footnotes
R.X., Q. Li, R.L., and J.S. contributed equally to this work. Yingrui Li and X.Y. contributed equally to this work.
Published online ahead of print. Publication date available at www.jasn.org.
This article contains supplemental material online at http://jasn.asnjournals.org/lookup/suppl/doi:10.1681/ASN.2016121331/-/DCSupplemental.
- Copyright © 2017 by the American Society of Nephrology