In Silico Biology 4, 0035 (2004); ©2004, Bioinformation Systems e.V.  

Selecting SNPs for association studies based on population frequencies: A novel interactive tool and its application to polygenic diseases

Steffen Möller1, *, Dirk Koczan1, Pablo Serrano-Fernandez1, Uwe K. Zettl2, Hans-Jürgen Thiesen1 and Saleh M. Ibrahim1

1 University of Rostock, Institute of Immunology, Schillingallee 70, 18055 Rostock, Germany
2 University of Rostock, Department of Neurology, Gehlsheimer Str. 20, 18147 Rostock, Germany

*  Corresponding author

Edited by E. Wingender; received April 02, 2004; revised June 25, 2004; accepted June 25, 2004; published July 26, 2004


Common complex polygenic diseases as autoimmune diseases have not been completely understood on a molecular level. While many genes are known to be involved in the pathways responsible for the phenotype, explicit causes for the susceptibility of the disease remain to be elucidated. The susceptibility to disease is thought to be the result of genetic epistatic interactions between common polymorphic genes. This polymorphism is mostly caused by single nucleotide polymorphisms (SNPs). Human subpopulations are known to differ in the susceptibility to the diseases and generally in the distribution of single nucleotide polymorphisms.

The here presented approach retrieves SNPs with the most divergent frequencies for selected human subpopulations to help defining properties for the experimental verification of SNPs within defined regions. A web-accessible program implementing this approach was evaluated for multiple sclerosis (MS), a common human polygenic disease. A link to a summary of data from "The SNP Consortium" (TSC) with sex-dependencies of SNPs is available. Associations of SNPs to genes, genetic markers and chromosomal loci are retrieved from the Ensembl project. This tool is recommended to be used in conjunction with microarray analyses or marker association studies that link genes or chromosomal loci to particular diseases.


Key words: SNP selection by population frequencies, association studies, Multiple Sclerosis, Ensembl


Quantitative trait loci (QTLs) in animal models and human susceptibility loci of genetic linkage association studies point to regions of the genome that are statistically linked to a disease [Complex Trait Consortium, 2003]. Microarray analyses further indicate candidate genes within QTLs [Dyment and Ebers, 2002; Lock, 2002; Robinson, 2003] by selecting on their expression profile, i. e. their distribution in tissues and/or differential expression between diseased and control groups.

Besides environmental factors or epigenetic effects, specific single nucleotide polymorphisms (SNPs) may singularly be responsible for a disease (Mendelian factor) or increase the susceptibility to it (polygenic factors). Genes harboring SNPs that are found unevenly distributed between control and disease groups may be considered as candidate genes [Foster and Sharp, 2002]. Associated genes hence would be subject to further investigations which may lead to an improved understanding of the pathogenesis of a certain disease and eventually to its therapy.

Investigation of the distribution of SNPs within and between human populations is laborious and costly, mainly due to the necessity of testing large numbers of individuals and SNPs. Establishing priorities in the selection of SNPs on the basis of additional experimental or epidemiological knowledge is therefore essential for speeding up the development of novel drugs. A SNP may not be itself involved in the conferment of susceptibility to a disease, instead it may be located within a region of linkage disequilibrium associated with the disease. Hence, the differential distribution of that SNP in subpopulations could be transferable to otherwise uncharacterized neighboring SNPs. The HapMap project [The International HapMap Consortium, 2004], for instance, addresses the challenge of genotyping individuals from different populations in a large scale. The data on population frequencies and haplotypes are available at the dbSNP database of the NCBI ( [Sherry et al., 2001; Wheeler et al., 2004]. Public SNP databases, i. e. dbSNP and HGVbase (Human Genome Variation base, [Fredman et al., 2004], are rapidly expanding. Besides HapMap, efforts like the one of the UW-FHCRC Variation Discovery Resource (SeattleSNPs) ( and The SNP Consortium (TSC) [Matise, 2003; Thorisson, 2003] contribute information on allele distribution in different populations, based either on race or geographical locations.

For polygenic diseases such information on SNP population frequency may be used as an additional selection criterion to prioritize the selection of SNPs and genes to be experimentally tested, whenever the disease affects different the different genders and/or populations unevenly. This has been suggested for Rheumatoid Arthritis [Steer, 2003], infectious diseases [Abel and Dessein 1998], post-traumatic stress responses [Zoellner et al., 1999] and Alzheimer's Disease [Arehart-Treichel, 2001] amongst others. Here we lay out the principle on the examples of multiple sclerosis (MS) for which epidemiological studies have shown Asians and (with conflicting reports) Africans to be considerably less susceptible to MS than Caucasians [Wallin, 2004; Wheeler, 2004].

Genetic linkage analysis in human yielded multiple loci that are statistically linked with susceptibility to MS. The here presented tool (SNPselect, for SNP selection on the basis of the Ensembl databases finally filters within these loci those SNPs for which the frequency distribution on subpopulations have been determined. The derived ratios between frequencies of multiple populations are offered as a criterion for an automated selection, which is an improvement in comparison to previous tools for selecting SNPs by population frequencies as the Ensembl-mart query environment [Kasprzyk et al., 2004], a recent development by Nguyen et al., 2004, or SNP3D [Peng You et al., unpublished,]. Even the search tools of the TSC or the HapMap project do not yet provide this service.

The tool is applied on the chromosomal loci associated to MS according to genetic linkage analyses. The SNPs in these regions with the most different frequencies between Caucasian (European and North American) and Asian or African populations are listed. The results of this tool when applied to the peptidylarginine deiminase gene cluster and to CD24 is compared with data from the literature.


Human MS susceptibility loci

Human MS susceptibility loci and genes associated to MS were assembled both from the LocusLink database [Wheeler et al., 2004] and from data of recent large scale studies on MS [Akesson et al., 2002; Dai et al., 2001; Dyment et al., 2001; Goedde et al., 2002; Saarela et al., 2002; Sawcer et al., 2002]. Tables 1 and 2 show the list of genes and markers derived from these studies that were analyzed with the SNPselect application. The boundaries of the human susceptibility regions were defined by the flanking markers described in the original publications or by 0.5 Mbp on either side of the disease-associated marker, where flanking markers were not available. For loci derived from polymorphisms within genes, the assignment of SNPs to genes provided by Ensembl was used.

Table 1: Multiple Scleroris Susceptibility Loci - Genes.
Chr Genes
3 TGFBR2, CCR5, CD80, IL12A
5 MAP1B, IRF1, IL13, IL4, IL9, FGF1, IL12B
15 NTRK3, Q8NDG8
16 IL17C
18 MBP
20 MMP9
The first column lists the chromosome, the second the genes used as markers that were found associated to Multiple Sclerosis. These gene names and external IDs were used to select SNPs in Ensembl.

Table 2: Multiple Scleroris Susceptibility Loci - Markers.
D1S498, D1S1590, D10S249, D10S1653, D12S326, D14S605, D16S3075, D17S1882, D17S928, D18S52, D19S585, D2S2739, D2S2330, D2S1395, D2S364, D2S1345, D21S1270, D22S423, D22S535, D22S280, D3S1304, D3S1278, D4S1592, D4S416, D6S2444 D4S2394, D4S2378, D6S1615, D6S1014, D6S2447, D6S434, D7S1809, D9S164, D9S1826, D9S158, DXS1060, DXS8051, DXS101
These markers were found associated to Multiple Scleroris and were used to select SNPs in Ensembl. The marker position was extended by 500 kbp to either side for the selection.

SNP databases

SNPs of the dbSNP and HGVbase databases are retrieved via the Ensembl databases version 19 [Kasprzyk et al., 2004;]. These are annotated with their chromosomal location and linked to the genes presented by Ensembl, e. g. SNPs coding at least one non-synonymous transcript can be selected. The database of the TSC was installed locally, so that any dbSNP entry with an associated TSC accession number is linked to a more detailed presentation of population frequencies that includes sex-dependencies for entries whose genotyping data is available.

SNP selection

All SNPs are retrieved for which a mismatch was determined between the frequencies in Americans or Europeans on the one side (as the more susceptible populations with respect to MS) and Asians or Africans on the other side (less susceptible populations). An additional population of multinationals is available for some SNPs with a particularly strong heterogeneity. For the analysis, loci derived from polymorphisms in genes are treated separately from those derived from genetic markers. SNPs with the highest frequency ratios are selected for further analysis. The selected SNPs may be linked for additional verification to an additional source of SNPs and their frequencies is provided by Applied Biosystems and its SNP genotyping selector ( It is based on the same cell cultures as for the TSC and both frequencies of SNPs in races and protocols for TaqMan based analysis are offered.

Web interface

The application SNPselect was developed for the selection of SNPs by frequency values and the ratio of those frequencies between populations. The application queries the Ensembl mart database to bring the chromosomal location and the genome annotation of Ensembl to learn about the position of the SNP within genes. The web interface directly reflects the schema of the Ensembl mart SNP tables. A data workflow is summarized in Figure 1.

Figure 1: Data workflow of the SNP search algorithm. Search constraints are introduced by the user in the web interface (SSI). The algorithm queries the databases, which are linked with each other, considering the constraints and calculates the corresponding SNP frequency ratios. Results are integrated and displayed as output of the interface.

For the calculation of the frequency ratios, it was assured that the frequencies for the same allele are compared. Otherwise, in cases when the minor allele for one population is the major allele of another and if only the respective minor frequencies were used to determine the ratio of frequencies, erroneous ratios would be calculated. For the figures, the maximum of the ratio and its reciprocal value is used, with 1 being the minimal ratio (equal frequencies). Hence, it is not distinguished between putative susceptibility alleles and protective alleles.


SNPs retrieved from MS susceptibility loci

Ratios of the SNP frequencies between the susceptible and the less susceptible populations are calculated. With the multinational population of maximum heterogeneity assigned to the group of less susceptible populations, 474 SNPs were found in loci associated to those genes, whose reported polymorphisms were used as a genetic marker in the MS association studies. Of these, 234 (49.4%) show a maximum ratio (or a maximum reciprocal of the ratio) higher than 2. The distribution is shown in Figure 2a. Ignoring the data for the multinational population, the number is reduced to 122 of 261 (46.7%). For the genetic marker-based loci, 121 of 319 SNPs (37.9%) could be selected (Figure 2b), 80 of 211 (37.9%) when disregarding the multinationals.

Figure 2: Distribution of frequency ratios between european or american and african or asian populations derived from genes (A) and genetic markers (B).
The x-axis displays the maximum of the ratio or the reciprocal of the ratio of the frequency of the minor allele of any two populations for a given SNP, the y-axis the relative number of occurrences within the multiple scleroris susceptibility loci. Demanding a ratio of 2 or higher, halves the number of putative susceptibility or protective alleles.

The golden path of Ensembl 19 (NCBI assembly 34) has a total length of 2,841 Mbp and 23,531 Ensembl genes are predicted, with an average distance of 121 kbp between genes. Hence, the area investigated around genetic markers is more than 4 times larger than for polymorphic genes, though the marker-based regions yielded only 2/3 of SNPs that could be selected by genes. This indirectly explains the difference in the effect of the ratio as a selection criterion. With more populations being studied for a single marker, the chance raises to find a subpopulation to differ in frequency. The effect from selecting on frequency ratios is much stronger for marker-derived SNPs than for those derived from polymorphic genes (Figures 2a and b). In contrast, the inclusion of the multinational subpopulation has no effect on either distribution. For the here presented case of MS susceptibility loci, only a single SNP (rs758767) was derived both by the marker and the gene-based selection as shown in Tables 3 and 4.

Table 3: SNPs with 2fold difference in frequencies derived from markers.
rs1007541, rs10082, rs1009473, rs1017361, rs1076150, rs1108581, rs1157595, rs1160844, rs1196455, rs1308156, rs1317685, rs134132, rs1344773, rs1365806, rs1367896, rs1370808, rs1370809, rs137997, rs138002, rs1410782, rs1417663, rs1437622, rs1451667, rs1473360, rs1476129, rs1545003, rs1611125, rs1770547, rs184003, rs1887519, rs2021678, rs204994, rs206019, rs206777, rs2072632, rs2072634, rs2270370, rs2629085, rs2734331, rs283368, rs283369, rs283382, rs33623, rs33636, rs345203, rs397081, rs423023, rs47808, rs487561, rs487820, rs500461, rs553877, rs561044, rs606141, rs608292, rs638778, rs695982, rs697449, rs711079, rs713938, rs714743, rs719499, rs730647, rs731653, rs733071, rs737779, rs739398, rs739560, rs744991, rs750759, rs756948, rs758767, rs77905, rs8283, rs882566, rs901375, rs915237, rs919643, rs979000, rs999200

Table 4: SNPs with 2fold difference in frequencies derived from genes.
rs1003692, rs1003723, rs1006396, rs1008515, rs1009148, rs1010167, rs1013948, rs1028580, rs1042031, rs1155708, rs1233391, rs1233397, rs1323658, rs1326282, rs1433099, rs1491709, rs1495101, rs1514347, rs1518110, rs1548554, rs1560975, rs1590, rs1599796, rs1649204, rs1713223, rs1794068, rs1799724, rs1800610, rs1800629, rs1800795, rs1800796, rs1800797, rs1800871, rs1861494, rs1874791, rs2038931, rs20541, rs2069718, rs2069762, rs2069763, rs2069879, rs2069882, rs2071459, rs2072592, rs2201584, rs2235330, rs2239704, rs2243136, rs2243250, rs2243263, rs2250889, rs226376, rs226379, rs2281089, rs2288831, rs228937, rs228942, rs231775, rs281437, rs281440, rs3021094, rs31564, rs315952, rs3211607, rs3212227, rs3213448, rs33998, rs344548, rs360722, rs375947, rs377690, rs3778082, rs430507, rs432001, rs432823, rs446037, rs454078, rs470279, rs470615, rs470907, rs470929, rs475825, rs4891, rs512535, rs523243, rs532117, rs5494, rs549908, rs583911, rs625456, rs673548, rs706781, rs709932, rs720541, rs729714, rs741780, rs744751, rs745993, rs746389, rs746868, rs758767, rs760720, rs765060, rs768170, rs795467, rs84182, rs84459, rs867234, rs869411, rs875989, rs891595, rs913059, rs926103, rs928940, rs932477, rs934062, rs947889, rs947890, rs9509, rs982764, rs988328, rs995185

Ensembl genes with SNPs

All the 319 marker-based SNPs had at least one gene assigned, 85 SNPs having 2 or more genes assigned. Not a single SNP has more than 4 genes assigned. One should recall that a single SNP may appear both downstream of one gene and upstream of another. 241 genes were linked to these SNPs, yielding a rate of 1.32 SNPs per gene. In contrast, merely 146 genes contributed to 474 gene-derived SNPs for which the assignment of the HUGO gene ID of Table 1 was a precondition for the selection. However, many more well-characterized SNPs are available (3.25). Figure 3 shows the distribution of SNPs per gene both from association to markers (3a) and to genes (3b).

Figure 3:: Distribution of SNPs per gene for genes derived from associations with genetic markers and genes.
SNPs associated to genes (A) are much better characterized in their population distribution than those associated to genetic markers (B).


To support the approach to prefer SNPs with different frequencies in different populations we took the latest publication reporting a SNP association to MS [Zhou et al., 2003]. The reported SNP (rs8734) is the only one with frequency ratios for the gene CD24 of the 32 dbSNP entries assigned to it. Also, it is the only SNP imposing a change of the coded peptide (Figure 4). It was indeed the case that the major allele of the Asian is the minor allele of the multinational population. The ratio of frequencies raised from 1.21 for the frequency of minor alleles to 1.79 (T) and 1.89 (C) when comparing the frequencies for the nucleotides directly. No information on sex difference was available for any of the five SNPs in CD24 with TSC accession numbers, however, CD24 is located in the Y chromosome.

Figure 4: Output of the SNPselect tool. The column "ratio" presents the ratio of the frequencies of the minor alleles of the multinational and the east asian population. The column "corrected ratio" takes into account that the minor allele of the Asian population is the major allele of the multinational population.

The gene peptidylarginine deiminase type IV (PADI4) is discussed in the context of both rheumatoid arthritis (RA) and multiple sclerosis [Vossenaar et al., 2003]. RA patients show autoantibodies to PAD proteins, but it was not known, which of PADI types I to IV genes are associated with the disease. For further insights were achieved by a case-control linkage disequilibrium analysis of RA patients and controls [Suzuki et al., 2003]. The maximum linkage disequilibrium was determined for the third exon of PADI4. A search with the SNPselect tool for SNPs with frequencies determined for a North American or European and any other population for the genes PADI 1-4 yielded 6 SNPs, none of which was an SNP of PADI1. The maximal ratios were indeed achieved for the SNP rs1635565: 3.04 fold between a North American and an East Asian population (a ratio of 2 was achieved for another East Asian population). PADI2 got a SNP with a ratio of 2 (rs733785, same populations). The ratios for PADI3 were all below 1.4.


We present the concept and a supporting web site to select SNPs for disease-linkage disequilibrium analysis with a preference on those which show different frequencies in populations that also differ in the prevalence of the disease. The chromosomal neighborhood of a set of genomic markers with linkage to MS and of MS candidate genes were investigated for SNPs.

The findings versus the expected

The marker-derived loci cover about 38 Mbp (38*0.5Mbp*2) of the genome, the genes add another 24 Mbp (99*0.12Mbp*2), in sum representing 2.2% of the Ensembl genome sequence ("golden path"). The Ensembl SNP database holds 4,763,257 SNPs, of which 22,670 (0.47%) have data assigned for two subpopulations to form a ratio for this study. Of these, only 793 (3.5%; 0.017% of the total) were found to be located in MS susceptibility loci. Hence, only every 132nd SNP is investigated for frequency ratios due to the requirement for each SNP to have frequency data assigned for at least two populations. With an average of 202 SNPs per gene, there are over 1.5 such SNPs per gene with frequency data for two populations. However, a strong bias of the selection towards those SNPs associated with well studied genes is expected [Brumfield et al., 2003] and indirectly supported by the different distributions for loci derived from genomic marker association studies and those associated to genes as described in the results section. However, with an increased number of SNPs characterized, the method will become steadily more powerful.

The density of genes with selected SNPs for markers (~ 9 genes per Mbp) is slightly higher than the average gene density (8.28 genes/Mbp), for gene-associated SNPs the gene density is even lower with 6 (146/24) genes per Mbp. This might evidence a strong bias for particular genes leaving an average of 2 genes per Mbp (25% of the average gene density) uncharacterized.

Reliability and completeness of public SNP data

Various concerns raised on the reliability of the SNP information in public databases still hold. A 50%-60% success rate of an entry in dbSNP to indicate a true polymorphism was reported [Marth et al., 2001]. However, those SNPs found to be differentially distributed in human subpopulations are particularly well verified.

Jiang et al., 2003, also address the completeness of public SNP data. It was found that for frequent alleles (>20% frequency of the minor allele) more than half are found in the databases while not necessarily characterized for multiple populations. For rare alleles only less than 20% could be found.

The limited number of individuals available for the test of disease association suggests to choose SNPs that appear frequently in at least one subpopulation.

Information missing from dbSNP

The approach does not address epistatic effects. Such disease-associated coupling between SNPs can only be investigated after a SNP analysis of the patient's phenotype. The HGVbase is developing in this direction and Ensembl has recently included links to the HapMap project. Links to the structure of SNPs [Stitziel et al., 2004] are only of secondary interest since a coupling of neighboring SNPs is expected. The SNP investigated may not have functional effects, but its particular distribution between patients and controls may point to a locus of particular importance.

For autoimmune diseases like MS with a sex-difference in the susceptibility, the information on the frequency of a particular SNP for the two sexes would be of strong interest. This information was found in the database of The SNP Consortium and a local web service (tscselect.php) was created for manual inspection.

A nice web portal to well-described SNPs is SNP3D ( by Peng You et al. It combines structural analysis with disease association and pathway data. The here presented tool SNPselect complements that functionality for the selection of SNPs.


A subset of the SNPs selected by this approach is now subject of analysis in a cohort of Caucasian MS patient from northern Germany. The loci will be further constrained by gene expression data from MS patients and the murine animal model EAE [Ibrahim et al., 2001]. Furthermore, we require the SNPs to lie within syntenic regions to mouse and rat that are themselves discussed as susceptibility loci for the animal model of MS [Serrano-Fernández et al., submitted]. With limited resources, both in patient samples and material, the here presented process for SNP selection helps to focus the analysis on the most promising candidates. Prospectively, it also allows the testing of multiple SNPs for a single gene in order to gain insights on the prior mentioned coupling of SNPs in individuals.

Figure 5: Sex difference of data from The SNP Consortium.
A graphical presentation and web-access was created to calculate the frequencies of SNPs in both races and sex. This information is not available from dbSNP or Ensembl. The figure represents an example for the SNP rs476646 (TSC0214958). Here, females are less frequently homozygous for G.


The authors thank Michael Kreutzer for his technical assistance. Peter Lorenz, Patrik Wernhoff and Vasilis Kotsikoris are thanked for comments and the critical reading of the manuscript. This work was funded by the BMBF project NBL3 (FKZ 01ZZ0108).