Polygenic Risk Score (PRS) Tutorial
OVERVIEW
The purpose of this tutorial is to furnish a step-by-step guide on the execution of a Polygenic Risk Score (PRS) analysis, accompanied by an introduction to fundamental genetic principles germane to this methodology.
The tutorial is divided into two sections, First, a series of basic concepts about genetics and PRS are introduced at a theoretical level. Secondly, the tutorial itself consists of 6 steps which are detailed below.
Step-by-step PRS analysis:
- STEP 0: Data selection + QC check
- STEP 1: SNPs Extraction
- STEP 2: SNPs Alignament check
- STEP 3: Quality control (QC)
- STEP 4: PRS calculation
- STEP 5: PRS resultsn
Software requirements
- Linux terminal access with the following tools
installed:
- PLINK 2 - https://www.cog-genomics.org/plink/2.0/
- bcftools - http://www.htslib.org/download/
- R
Required Data
To carry out this PRS analysis it is necessary two types of data. Here is the link to download the data used in the tutorial. If you want to use your own data, see what is necessary in the section STEP 0 - Data selection + QC check.
- SNPs DATA. List of SNPs associated with a characteristic of interest. This tutorial will use a list of 205 SNPs associated with colon cancer (CRC) from the article: Deciphering colorectal cancer genetics through multi-omic analysis of 100,204 cases and 154,587 controls of European and east Asian ancestries
You can download it directly here: Download - 1_CRC_SNPs_list_205_Nature_2023
Attention: This file has an error in column “POS (GRCh37)” for some
SNPs, the position was wrong, when compared to column “VARIANT”. The
file in data/resources has been fixed.
The same file can be
downlaoded from PSG catalog (https://www.pgscatalog.org/score/PGS003850/). However,
this file has the same errors and doesn’t have the column VARIANT to
help fixing it.
- TARGET DATA. Imputed genotyping data from a population of interest. Simulated data based on GenRisk data will be used here. For reasons of space, only the data of the patients with the SNPs associated with CRC can be downloaded (document obtained in STEP 1) Download - GenRisk_all_CRC_SNPs_perm.vcf .
To be used from the section Step 1 - SNP extraction. 2 VCF files concatenation
- METADATA. Phenotypic data of the target data. In this case which of the samples is controlled and which CRC. Metadata_GenRIsk -
Attention: Whether you wish to utilize your own data or existing datasets, it’s important to note that the data used should have already undergone a minimum quality control (QC) process, with hg37 as the reference genome. For the target data, it has been imputed using the 1000 Genomes reference.
For more information regarding the format of the data to be used or how to adapt see here
Info
This tutorial requires R and a linux terminal. You can use Rstudio or Visual Studio Code (with R plugins), that include a terminal.
Both in R and in terminal, first check that you are in the correct directory. In R remember to load above mentioned libraries.
You can clone our github repository to reproduce this practical. The exception is STEP 1, that requires real imputed genotype data. We canot share real data, but we provide fake data resulting from the pipeline in directort data/resources)
What’s a Polygenic Risk Score
A Polygenic Risk Score (PRS) is a single numerical metric that
estimates an individual’s genetic risk for a particular disease or trait
(phenotype). This score is calculated by summing the contributions of
risk alleles carried by the individual, each weighted by its respective
effect size as determined by a Genome-Wide Association Study (GWAS)
specific to the phenotype in question.
In essence, the PRS encapsulates the combined impact of various genetic variants, predominantly Single Nucleotide Polymorphisms (SNPs), within an individual’s genome. A higher PRS indicates a greater genetic predisposition, while a lower score suggests a reduced risk, thereby facilitating the prediction of disease susceptibility or trait expression based on one’s genetic makeup. This approach offers valuable insights into the genetic architecture of complex traits and diseases.
Effect size - βi
In PRS, there exist two primary approaches:
Unweighted PRS, where each genetic variant is assigned a fixed weight of 1 (βi = 1).
Weighted PRS, which employs the actual effect sizes, typically represented as betas (βi) or Odds Ratios (OR), derived from GWAS specific to the phenotype in question.
The choice between these two approaches is contingent upon several key factors.
¿When to use an unweighted PRS (βi = 1)?
- Your data have been used for the GWAS -> model overfitting
- Poorly robust GWAS
PRS applications
- Disease risk prediction
- Screening and prevention.
- Personalized treatment
- Epidemiological Studies
- Drug Development
Step-by-step PRS analysis
The step-by-step analysis structure that will be followed to perform the PRS analysis is as follows:
STEP 0: Data selection + QC check. The process begins with the careful selection of genetic data (SNPs data and target data) relevant to the study, ensuring that it accomplishes general quality control standards commonly used in GWAS analyses.
STEP 1: SNPs Extraction. Consists of the extraction of the SNPs associated with the study characteristic or disease from the total pool of SNPs in the target dataset where the PRS will be calculated.
STEP 2: SNPs Alignament check. Discrepancies, such as missing SNPs, duplicates, variations in strand orientation, and more, can occur between the SNPs in the SNP dataset and the target data. These disparities may arise due to differences in platforms, imputation methods, or genome builds. Therefore, it is essential to carefully evaluate and resolve such differences whenever possible.
STEP 3: Quality control (QC). Imputed genotyping data, QC is performed both before and after imputation, ensuring the use of high-quality data and the exclusion of poorly imputed variants.
STEP 4: PRS calculation. Calculate the PRS by combining the genetic variants (SNPs) and their associated effect sizes in the target data.
STEP 5: PRS results. This involves drawing meaningful insights from the calculated scores and making conclusions about genetic predisposition for the studied phenotype.
Other Basic Genetic Concepts
STEP 0 - Data selection + QC check
Below is detailed where to find, in which formats, what minimum information is required as well as what minimum QC the data must have passed.
SNPs Data Selection
The initial phase in the analysis of the Polygenic Risk Score (PRS) involves acquiring the base data, which comprises a list of SNPs associated with the specific trait or condition of interest.
These SNP datasets can be sourced from publicly available GWAS studies found in the literature or various genetic databases. Alternatively, researchers can conduct their own GWAS to generate the required summary statistics.
For the purpose of this tutorial, we will utilize SNP data associated with Colorectal Cancer (CRC) as our reference dataset.
SNPs data are a set of individual SNPs that have been identified as related to the characteristic or trait studied through a GWAS study. These data usually include the position of the SNP in the genome, the estimated genetic effect (such as odds ratio or beta), the allelic frequency, and other related statistics.
- Where to find:
- Polygenic Score (PGS) Catalog. https://www.pgscatalog.org/
- GWAS Catalog. https://www.ebi.ac.uk/gwas/
- Bibliography
- Calculate your own GWAS
- Minimum information required:
- Chromosome nº
- Position
- Effect and non-effect allele
- Effect size (betas or OR)
- QC check:
It is necessary to ensure that the list of SNPs selected has passed a minimum QC check. -> those applied to a GWAS
- Genome build -> Check that the reference genome is the same between the SNP data and the target data. Otherwise, it is necessary to carry out a Liftover. Better perform on the data of the SNPs of interest.
- Effect allele
- Standard GWAS QC:
- Genotyping rate >0,99
- Remove SNPs with >5% missing data
- Heterozygosity. Remove samples > mean(proportion of heterozygous sites ) ± 3 * sd(proportion of heterozygous sites)
- Exclude individuals with sex discordance
- Imputation information score (INFO score). > 0.3 – 0.8
- Minor Allele Frequency (MAF). > 0.005 – 0,001
- Linkage-disequilibrium (LD) score
- Hardy-Weinberg equilibrium (HWE) P>1x10-6
Target Data
TARGET DATA means the genotyping data of a population already imputed. (Usually own data - collaborations)
This data can be in multiple formats some of the most common are detailed below. (Depending on where they are downloaded or with which the charge is made).
- Imputed genotype data formats:
- Variant Call Format (VCF). e.g chr.{i}vcf.gz
- PLINK 1.9 format [.bed] [.bim] [.fam]
- BGEN format [.bgen], [.sample], [.bgen.bgi]
- PLINK 2 format [.pgen], [.pvar], [.psam]
NOTE: Generally, imputation with 1000 genomes or TopMed results in vcf files separated by chromosomes. This tutorial starts with data in this format.
- QC check:
- More than 100 individuals
- Genome build
- No sample overlap with SNPs GWAS samples
- Standard GWAS For samples:
- Individuals genotype rate >0.99
- Remove sample with >10% missing data
- Heterozygosity. Remove samples > mean(proportion of heterozygous sites ) ± 3 * sd(proportion of heterozygous sites)
- Exclude individuals with sex discordance
- Remove duplicates and relatedness samples. PI_HAT > 0.8 For SNPs:
- Imputation information score (INFO score). > 0.3 – 0.8
- Minor Allele Frequency (MAF). > 0.01 - 0.005
- Linkage-disequilibrium (LD) score
- Hardy-Weinberg Equilibrium (HWE) p > 1x10-6
ATTENTION! The quality control for this data is conducted both before and after imputation. Post-imputation quality control (QC) related to SNPs can be carried out either before initiating the Polygenic Risk Score (PRS) analysis or during the analysis. In this instance, SNP QC is performed after extracting the SNPs of interest from the Target Data. This approach is chosen because the dataset is smaller, making the process faster.
STEP 1 - SNPs Extraction
This section consists of two parts:
SNPs list preparation
Preparation of a list of SNPs associated with CRC for extraction in the target data. A .txt file with two columns:
- Chromosome number
- SNP position
ATTENTION! If the reference genome does not match the list of SNPs associated with the study trait and the target data to be used, a Liftover must be performed, see here.
In this case both the list of SNPs and the target date are in GRCh37.
- Load SNPs associated to CCR:
SNPs Data from Ceres Fernandez-Rozadilla et al. Nature 2023, saves as: .1_CRC_SNPs_list_205_Nature_2023.xlsx in resources folder.
R
CHR | POS (GRCh37) | VARIANT | rsID | EFFECT_ALLELE | OTHER_ALLELE | BETA |
---|---|---|---|---|---|---|
1 | 22503282 | 1:22503282_G/C | rs2807367 | C | G | -0.03790 |
1 | 22710877 | 1:22710877_G/C | rs34963268 | C | G | -0.06589 |
1 | 38461319 | 1:38461319_C/A | rs61776719 | A | C | -0.05610 |
1 | 55247852 | 1:55247852_A/G | rs12143541 | G | A | 0.08250 |
1 | 62673037 | 1:62673037_T/C | rs7542665 | C | T | 0.01850 |
- Cration of chr position .txt file
The CHR and POS columns (GRCh37) are saved in a .txt file.
CHR | POS (GRCh37) |
---|---|
1 | 22503282 |
1 | 22710877 |
1 | 38461319 |
1 | 55247852 |
1 | 62673037 |
1 | 71040166 |
- Save .txt file
R
# two avoid scientific annotation
SNP_file_CRC$`POS (GRCh37)` <- format(SNP_file_CRC$`POS (GRCh37)`, scientific=FALSE)
# create folder data/1_snps_extraction/ and save as .txt
dir.create("data/1_snps_extraction/", recursive=TRUE)
write.table(SNP_file_CRC, "data/1_snps_extraction/1_SNP_file_CRC.txt", quote = F, col.names = F, row.names = F, qmethod = "double", sep = "\t")
SNPs extraction
The aim of this step is to extract data on SNPs associated with colorectal cancer from GenRisk data.
It consists of 3 substeps:
- GenRisk imputed data
In this case, the GenRisk data already imputed by 1000 genomes will be used. Data imputed with TopMed or 1000Genomes servers return data in vcf.gz format:
ATTENTION! The data is substantial and cannot be shared. As a solution, the code for extracting CRC SNPs using the Linux bcftools tool, chromosome by chromosome, will be demonstrated without execution. Next, the process of merging individual files from each chromosome will be presented. This final outcome is what will be made available to proceed with the exercise.
In other words, the data to which you have access are the results of the VCF files concatenation section and are called: GenRisk_all_CRC_SNPs.vcf.gz
On the other hand, the shared data have been modified to ensure compliance with personal data protection regulations.
- 1 SNPs extraction:
LINUX Terminal
# Extraction of the SNPs in a loop for the different chromosomes.
## bcftool view <directory/_chr*_vcf.gz> -R <directory/SNP_list.txt> -o <directory/output_file.vcf.gz>
### view -> allows filtering and viewing data contained in VCF files.
### -R -> Region. specific region of the genome
### -o -> Output.
for i in {1..22}
do
bcftools view data_dir/GenRisk_chr${i}.vcf.gz -R data/1_snps_extraction/1_SNP_file_CRC.txt -o data/1_snps_extraction/GenRisk_chr${i}_CRC_SNPs.vcf.gz
done
#time 1 min 30 s
The application of this code generates new vcf files for each chromosome, only with the SNPs present in the file 1_SNP_file_CRC.txt. The following vcf files are generated:
- 2 VCF files concatenation:
Merge of GenRisk_chr${i}_CRC_SNPs.vcf.gz into a single file:
R
This is the list of files to be concatenated that is created:
LINUX Terminal
GenRisk_all_CRC_SNPs_perm.vcf.gz is the data that can be downloaded (folder data/resources) and worked from here.
NOTE: The GenRisk_chr${i}_CRC_SNPs.vcf.gz files created during SNP extraction are intermediate, and it is recommended to delete them later to avoid unnecessary space consumption
LINUX Terminal
- 3 Conversion to PLINK2 format
Plink2 is a widely utilized tool for genetic data analysis. Converting VCF files to Plink2 enables the utilization of various features, including the creation of genetic relationship matrices and the calculation of polygenic risk scores (PRS). This conversion not only streamlines data compatibility with other tools but also enhances execution speed. Plink2, being equipped with PRS calculation functionality, further facilitates genetic analysis processes.
PLINK2 format
PLINK2 conversion
Below conversion to PLINK 2 format, relabel variant IDs to ensure they have a unique identifier, and print a frequency report necessary to identify ambidextrous SNPs (SNPs QC, next steps).
Resulting files: - PLINK 2 format (.pgen, .pvar, .psam). - Obtain allele frequency report (.afreq).
LINUX Terminal
Code notes:
- –vcf.gz -> input file
- –new-id-max-allele-len 50 -> set the maximum length for new allele identifiers (IDs) in the output format. By default it is 14 and it may not be enough
- –set-all-var-ids @:#$r$a -> set SNPs id -> chr:pos_ref_alt
- –freq -> compute allele frequency and output to .afreq file
- –make-pgen -> generate files in plink 2 format
Results: When applying this caddy, 4 files are generated:
- raw_data.freq
- raw_data.log
- raw_data.pgen
- raw_data.psam
- raw_data.pvar
QUESTION 1 How many variables have been detected? How many samples?
205
variables and 4295
samples
STEP 2 - SNPs Alignament check
Inconsistencies, including absent SNPs, duplications, variations in strand orientation, and other discrepancies, may exist between the SNP dataset and the target data. These variations could stem from divergences in platforms, imputation methods, or genome builds. Hence, it is crucial to thoroughly assess and address these distinctions whenever feasible.
These discrepancies can be:
Load data
R
R
CHR | POS | ID | REF | ALT | FILTER | INFO |
---|---|---|---|---|---|---|
1 | 22503282 | 1:22503282_G_C | G | C | PASS | AF=0.63412;MAF=0.36588;R2=0.94997;IMPUTED;AC=5515;AN=8590 |
1 | 22710877 | 1:22710877_G_C | G | C | PASS | AF=0.18103;MAF=0.18103;R2=0.96771;IMPUTED;AC=1590;AN=8590 |
1 | 38461319 | 1:38461319_C_A | C | A | PASS | AF=0.49037;MAF=0.49037;R2=0.79922;IMPUTED;AC=4458;AN=8590 |
1 | 55247852 | 1:55247852_A_G | A | G | PASS | AF=0.12046;MAF=0.12046;R2=0.87593;IMPUTED;ER2=0.83497;TYPED;AC=1028;AN=8590 |
1 | 62673037 | 1:62673037_T_C | T | C | PASS | AF=0.66593;MAF=0.33407;R2=0.99843;ER2=0.96251;TYPED;AC=5727;AN=8590 |
1 | 71040166 | 1:71040166_G_T | G | T | PASS | AF=0.38298;MAF=0.38298;R2=0.80064;IMPUTED;ER2=0.84;TYPED;AC=3339;AN=8590 |
Missing SNPs
The selected SNPs may not all be available in the target dataset (Missing SNPs), which could be due to:
- Different Genotyping / imputation platform
- SNP have not passed QC (low quality)
Although it looks like all SNPs have been found (205) Let’s see if any are missing.
Find SNPs by chr:position
R
QUESTION How many variables were found to be missing in our data? Which one?
CHR | POS (GRCh37) | VARIANT | rsID | EFFECT_ALLELE | OTHER_ALLELE | BETA | SE | P | REF | PMID | PMCID | DOI | PUBLISHED_SNP | LD across all included populations | id_pos |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X | 9763898 | 23:9763898_C/G | rs2732875 | C | G | 0.169 | 0.024 * | 8.3E-12 * | Dunlop MG, Nat Genet, 2012 | 22634755 | PMC4747430 | 10.1038/ng.2293 | rs5934683 | R2:0.28; D’:0.94 | X:9763898 |
1
variable have been identified as missing. The
missing SNP is the one on the X chromosome, as sex chromosome data is
not being used.
QUESTION What is the problem with the
205 SNPs extracted if we already know that one is missing?
The problem is that if the missing X chromosome SNP is subtracted from the initial 205 SNPs, theoretically, we would expect to extract 204 SNPs, indicating an extra SNP. There is a duplicated/multi-allelic/bialellic SNP?
Multi-allelic / Duplicated / Biallelic SNPs
Let’s check if there are any duplicate chr:position IDs:
Yes, there is a duplicated chromosome:position ID. Which one?
CHR | POS | ID | REF | ALT | FILTER | INFO | id_pos |
---|---|---|---|---|---|---|---|
12 | 64404555 | 12:64404555_T_A | T | A | PASS | AF=0.0005;MAF=0.0005;R2=0.56084;IMPUTED;AC=3;AN=8590 | 12:64404555 |
12 | 64404555 | 12:64404555_T_C | T | C | PASS | AF=0.46244;MAF=0.46244;R2=0.80188;IMPUTED;AC=3995;AN=8590 | 12:64404555 |
To know which one is the incorrect, it must be compared with the
original listing.
R
CHR | POS | ID | REF | ALT | FILTER | INFO | id_pos |
---|---|---|---|---|---|---|---|
12 | 64404555 | 12:64404555_T_A | T | A | PASS | AF=0.0005;MAF=0.0005;R2=0.56084;IMPUTED;AC=3;AN=8590 | 12:64404555 |
R
This incorrect SNP should be removed from the data. This process is carried out together with the quality control in STEP 3 - QC.
Mismatching SNPs
It is essential to verify whether the REF and ALT alleles in the target data match those in the SNP list. For example, in the target data, there may be SNPs with A/C alleles in the base data and G/T alleles in the target data. This discrepancy can occur when one dataset was genotyped with reference to the forward strand, and the other was genotyped with reference to the reverse strand.
SNPs that have mismatched alleles reported in the original SNP list and the target data can be resolved by strand-flipping the alleles to their complementary alleles. Most polygenic scoring software programs perform strand shifting automatically for resolvable SNPs and remove mismatched non-resolvable SNPs.
SNPs with mismatched alleles between the SNP list and the target data can be resolved by “strand-flipping” the alleles to their complementary pairs. Most polygenic scoring software programs automatically perform strand flipping for resolvable SNPs and exclude non-resolvable SNPs with mismatches.
PLINK2 does not do this automatically, if there is a mismatching SNP in the data, the “strand-flipping” must be done manually.
All SNPs are then verified to have the correct alleles, using ID = chr:position_other_allele_effect_allele.
CHR | POS (GRCh37) | VARIANT | rsID | EFFECT_ALLELE | OTHER_ALLELE | BETA | SE | P | REF | PMID | PMCID | DOI | PUBLISHED_SNP | LD across all included populations | id_pos | ID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X | 9763898 | 23:9763898_C/G | rs2732875 | C | G | 0.169 | 0.024 * | 8.3E-12 * | Dunlop MG, Nat Genet, 2012 | 22634755 | PMC4747430 | 10.1038/ng.2293 | rs5934683 | R2:0.28; D’:0.94 | X:9763898 | 23:9763898_C_G |
It has been confirmed that all SNPs have the correct alleles, except for the SNP on the X chromosome
Ambiguous SNPs
Ambiguous (Palindromic) SNPs are those in which the allelic variants are complementary. In the context of PRS calculation, they are SNPs where the two allelic variants are symmetrical (e.g., A/T in the SNP list and T/A in the target data) and have similar frequencies, making it difficult to distinguish which variant is the effective one. If you have access to the effect allele frequency (EAF) data from the list of SNPs of interest, you can compare it with the EAF data in your dataset to determine which allele is the reference (REF). However, EAF data is rarely available in downloadable SNP lists.
In cases where EAF data is not available or when EAF is close to 0.5, these ambiguous SNPs should be excluded. In PRS calculations, SNPs with allelic frequencies close to 0.5 (between 0.4 and 0.6) are considered ambiguous and are typically excluded.
To check if there are ambiguous SNPs in the data, first load the EAF data from the target data obtained during the extraction of the SNPs and apply an EAF filter >0.4 | <0.6:
R
CHROM | ID | REF | ALT | ALT_FREQS | OBS_CT |
---|---|---|---|---|---|
1 | 1:22503282_G_C | G | C | 0.641909 | 8590 |
1 | 1:22710877_G_C | G | C | 0.185099 | 8590 |
1 | 1:38461319_C_A | C | A | 0.518859 | 8590 |
1 | 1:55247852_A_G | A | G | 0.119674 | 8590 |
1 | 1:62673037_T_C | T | C | 0.666705 | 8590 |
1 | 1:71040166_G_T | G | T | 0.388591 | 8590 |
R
Those SNPs are saved to a .txt file to be removed later.
R
STEP 3 - Quality control (QC)
It is important to ensure that the SNPs chosen to calculate the PRS present a minimum of quality in the Target data. For this reason, the following points will be reviewed:
Any SNP that does not pass this QC will be removed in the final step of this section, Apply QC exclusions.
a) Imputation information score (INFO)
Imputation is a process used to predict or estimate genotypic information for unobserved or missing genetic variants (SNPs) in a dataset based on available genotype data and reference panels.
The imputation information (INFO score – r2) statistic is a measure of imputation quality which typically takes values between 0 and 1, where 0 indicates complete uncertainty and 1 represents complete certainty about the imputed genotype.
Variants with a low imputation information score (r^2 < 0.3 - 0.8) are often regarded as less reliable and are typically excluded from further analyses.
In this case a threshold of 0.4 is used.
R
CHR | POS | ID | REF | ALT | FILTER | INFO | id_pos |
---|---|---|---|---|---|---|---|
1 | 22503282 | 1:22503282_G_C | G | C | PASS | AF=0.63412;MAF=0.36588;R2=0.94997;IMPUTED;AC=5515;AN=8590 | 1:22503282 |
1 | 22710877 | 1:22710877_G_C | G | C | PASS | AF=0.18103;MAF=0.18103;R2=0.96771;IMPUTED;AC=1590;AN=8590 | 1:22710877 |
1 | 38461319 | 1:38461319_C_A | C | A | PASS | AF=0.49037;MAF=0.49037;R2=0.79922;IMPUTED;AC=4458;AN=8590 | 1:38461319 |
1 | 55247852 | 1:55247852_A_G | A | G | PASS | AF=0.12046;MAF=0.12046;R2=0.87593;IMPUTED;ER2=0.83497;TYPED;AC=1028;AN=8590 | 1:55247852 |
1 | 62673037 | 1:62673037_T_C | T | C | PASS | AF=0.66593;MAF=0.33407;R2=0.99843;ER2=0.96251;TYPED;AC=5727;AN=8590 | 1:62673037 |
1 | 71040166 | 1:71040166_G_T | G | T | PASS | AF=0.38298;MAF=0.38298;R2=0.80064;IMPUTED;ER2=0.84;TYPED;AC=3339;AN=8590 | 1:71040166 |
In pvar the data about INFO (r^2) is in the column called INFO where there is a lot of other data. R2 is extracted and stored in a new column called R2.
eg. INFO estructure AF=0.63412;MAF=0.36588;R2=0.94997;IMPUTED;AC=9067;AN=14206 or AF=0.12046;MAF=0.12046;R2=0.87593;IMPUTED;ER2=0.83497;TYPED;AC=1687;AN=14206
The information to be extracted can be seen in bold.
R
CHR | POS | ID | REF | ALT | FILTER | INFO | id_pos | INFO2 | R2 |
---|---|---|---|---|---|---|---|---|---|
1 | 245181421 | 1:245181421_T_C | T | C | PASS | AF=0.00087;MAF=0.00087;R2=0.06303;IMPUTED;AC=0;AN=8590 | 1:245181421 | AF=0.00087;MAF=0.00087;R2=0.06303;IMPUTED;AC=0;AN=8590 | 0.06303 |
12 | 31594813 | 12:31594813_C_T | C | T | PASS | AF=0.00089;MAF=0.00089;R2=0.03916;IMPUTED;AC=0;AN=8590 | 12:31594813 | AF=0.00089;MAF=0.00089;R2=0.03916;IMPUTED;AC=0;AN=8590 | 0.03916 |
These SNPs will be added to a .txt file called to_remove later,
to be filtered in the last step of this QC.
b) Minor allele frequency (MAF)
Minor Allele Frequency (MAF) indicates the rarity of a genetic variant within a population. Variants with a MAF greater than 5% are considered common, those with a MAF between 1% and 5% are categorized as low frequency, and variants with a MAF less than 1% are considered rare.
In PRS analyses, it’s common practice to filter out SNPs with low frequency because they may not have enough statistical power to provide meaningful insights. In this case, we will apply a threshold of 0.005 directly to the Plink2 function using the –maf flag. This threshold helps ensure that only SNPs with a MAF greater than or equal to 0.005 are included in the analysis, focusing on more common variants for robust and reliable results.
c) Linkage Disequilibrium (LD)
Linkage disequilibrium (LD) measures the connection between closely located genetic variants that often share inheritance patterns because of their proximity, leading to a genetic association within a population.
During the construction of a PRS, variants in high LD will have been identified and removed by methods such as clumping / pruning. This is done to reduce redundancy and bias in the PRS analysis, as it is desired that the markers used be as informative as possible without duplicating information.
In order to verify this, or apply more stringent thresholds, the
PLINK 2 command –indep-pairwise can be used to prune
the data, resulting in a subset of variants with LD below a specified
threshold.
LINUX Terminal
# Pruning / clumping to remove highly correlated SNPs:
# make folder to save QC results
mkdir -p data/3_QC
# remove redundant SNPs
plink2 \
--pfile data/1_snps_extraction/raw_data \
--indep-pairwise 200 1 0.3 \
--out data/3_QC/LD
## "200" refers to the distance in kilobases (kb) between the SNPs
## "1" refers to the maximum number of SNPs that can be included in a clumping group.
## "0.3" is the correlation threshold (R^2) used to determine if two SNPs are "clumpable"
After applying this formula, two files are generated:
- LD.prune.out -> Contains SNPs to be removed due to linkage disequilibrium.
- LD.prune.in -> Contains SNPs that do not exhibit linkage disequilibrium.
d) Hardy-Weinberg Equilibrium (HWE)
Hardy-Weinberg Equilibrium (HWE) is a principle stating that allele and genotype frequencies should remain constant in a stable, unchanging population. Deviations from HWE may suggest genotyping errors, leading to the exclusion of such variants in analyses.
The PLINK 2 command –hwe filters out variants deviating from HWE using a specified p-value threshold.
HWE was previously tested in the GenRisk data; however, it is now confirmed using a threshold of 1e-6:
LINUX Terminal
No variant should be deleted fro HWE.
Apply QC exclusions
All SNPs that have not passed QC, as well as ambiguous ones, are saved to a file called to_remove.txt and removed from the Target Data using PLINK2.
to_remove.txt creation
In the PRS calculation function, Plink only accepts a list of SNPs to be excluded. Therefore, we need to combine the list of:
- SNP extracted and not included in the list of interests.
1
SNP, save as multi_to_remove.txt - Ambiguous SNPs.
3
SNPs, save as exclrsIDs_ambiguous.txt - SNPs with low INFO.
2
SNPs, save as r2_04_remove (r object). - LD SNPs.
2
SNPs, save as ld_snps (r object)
This combined list is saved in a file named “to_remove.txt.”
R
# The incorrect SNP extracted load.
multi_remove <-read.table("data/2_SNPs_alignament_check/multi_to_remove.txt", colClasses = c("character"))
colnames(multi_remove) <- c("CHROM", "ID", "REF", "ALT")
multi_remove
## CHROM ID REF ALT
## 1 12 12:64404555_T_A T A
R
R
##
## FALSE
## 3
##
## FALSE
## 3
##
## FALSE
## 1
##
## FALSE
## 2
##
## TRUE
## 2
##
## FALSE
## 2
QUESTION Finally, how many SNPs have failed QC and will be removed?
There are
2
SNPs that present both LD and a low
imputation quality score. So in total we have 6 SNPs that have not
passed the QC and will be removed.
R
# Union of the tables
to_remove <- unique(c(ambiguous_snps_ID,multi_remove_ID,r2_04_remove_ID,ld_snps ))
to_remove
## [1] "13:34093518_C_G" "15:33009574_C_G" "16:86338288_A_T" "12:64404555_T_A"
## [5] "1:245181421_T_C" "12:31594813_C_T"
R
Filtering of SNPs with PLINK2
The next step involves data filtering to exclude SNPs in the to_remove.txt. On the other hand, it is still pending to apply the filter based on Minor Allele Frequency (MAF) using the –maf flag.LINUX Terminal
The outputs of the command above are:
- dataQC.pgen, dataQC.psam, dataQC.pvar: Data after SNP QC (excpet HWE) in PLINK 2 binary format
- dataQC.snplist: List of IDs of SNPs that passed QC
with the specified thresholds (from
--write-snplist
)
QUESTION What is the total number of SNPs remaining?
199
SNPs
From the initial 205, the following have been removed:
-1 for being on the X sex chromosome
-3 for ambiguous
-2
for low imputation score
-0 for low MAF
-2 (idem INFO) LD
STEP 4 - PRS calculation
- 1 Prepare score.txt
- 2 PRS calculation
- 3 PRS score rescaling
1 Prepare score.txt
Creating a file with 3 columns:
- ID of the SNPs that have passed the QC
- The RISK_ALLELE
- Betas
Thus ensuring that the risk allele according to the beta value.
This process is carried out to ensure that the PRS are consistent and accurately reflect genetic risk. In some genetic analyses, beta values associated with a SNP can be positive or negative depending on their association with a particular disease or trait. By changing the sign of negative betas to make them positive, it ensures that all genetic contributions are positive. This makes it easier to interpret them in terms of cumulative genetic risk.
Load data
It is necessary both the data of the SNPs of the target data and the beta values of the original list of the SNPs of interest. A series of modifications are made to match column names and data structure.
The ID in the target data has this structure: chr:position_ref_alt - e.x: 1:110365045_G_A.
In the case of the original SNPs list with beta data (representing the weight of each SNP in the trait under study), the structure is as follows: chr:position_other_risk, where the risk allele is determined by the sign of the beta value. In other words:
- For SNP 1:22503282_G_C with a positive beta value, the risk allele is C.
- For SNP 1:22503282_G_C with a negative beta value, the risk allele is G
R
# Load Info SNPs resultants del QC
snps_qc<-read.table("data/3_QC/dataQC.pvar")
colnames(snps_qc) <- c("CHROM", "POS", "ID", "REF", "ALT", "FILTER", "INFO")
#Transform ID to: chr:position_all_1:all_2 format
#snps_qc$ID <- gsub("_",":",snps_qc$ID)
#Keep ID and ALT / RISK_ALLELE
names(snps_qc)[names(snps_qc) == "ALT"] <- "RISK_ALLELE"
names(snps_qc)[names(snps_qc) == "REF"] <- "OTHER_ALLELE"
snps_qc <- snps_qc[,c("ID", "RISK_ALLELE","OTHER_ALLELE" )]
ID | RISK_ALLELE | OTHER_ALLELE |
---|---|---|
1:22503282_G_C | C | G |
1:22710877_G_C | C | G |
1:38461319_C_A | A | C |
1:55247852_A_G | G | A |
1:62673037_T_C | C | T |
1:71040166_G_T | T | G |
From the data of the initial SNPs list we only save the ID and the beta value:
R
# Load beta values.
betas <- read_excel("data/resources/1_CRC_SNPs_list_205_Nature_2023.xlsx")
#Transform ID to: chr:position_all_1:all_2 format -> 1:110365045_A_G
names(betas)[names(betas) == "VARIANT"] <- "ID"
names(betas)[names(betas) == "BETA"] <- "RISK_SCORE"
#betas$ID <- gsub("_",":",betas$ID)
betas$ID <- gsub("/","_",betas$ID)
betas <- betas[,c(3,7)]
ID | RISK_SCORE |
---|---|
1:22503282_G_C | -0.03790 |
1:22710877_G_C | -0.06589 |
1:38461319_C_A | -0.05610 |
1:55247852_A_G | 0.08250 |
1:62673037_T_C | 0.01850 |
1:71040166_G_T | -0.04060 |
Merge information
It is checked that beta values are available for all SNPs and the two tables are joined according to the IDs.
R
##
## TRUE
## 199
ID | RISK_ALLELE | OTHER_ALLELE | RISK_SCORE |
---|---|---|---|
1:110365045_G_A | A | G | 0.04020 |
1:172864224_A_G | G | A | -0.04088 |
1:183056222_T_C | C | T | -0.07060 |
1:201885446_G_T | T | G | -0.03869 |
1:205163798_G_A | A | G | -0.07142 |
1:222045446_G_T | T | G | 0.05670 |
setting all beta values to positive
Modification of the score_modif.txt list with the aim of setting all beta values to positive.
In cases where the beta value is negative:
1.- the risk alleles are exchanged - other 2.- The sign (- -> +) of the beta value is changed.
The results are saved in score_final.txt
R
ID | RISK_ALLELE | OTHER_ALLELE | RISK_SCORE | RISK_ALLELE_2 |
---|---|---|---|---|
1:110365045_G_A | A | G | 0.04020 | A |
1:172864224_A_G | G | A | -0.04088 | G |
1:183056222_T_C | C | T | -0.07060 | C |
1:201885446_G_T | T | G | -0.03869 | T |
1:205163798_G_A | A | G | -0.07142 | A |
1:222045446_G_T | T | G | 0.05670 | T |
R
#Change Risk_allele <-> Other_allele in the negative beta values cases:
for (i in 1:nrow(score)){
# iterate over the column nº4 ("RISK_SCORE") and check if <0
if(score[i,4]<0){
# replace the value with OTHER_ALLELE (column 3) in the new column RISK_ALLEL_2 (5)
score[i,5]<- score[i,3]
}
}
#Change beta value sign - to -> +
score$RISK_SCORE_2 <- as.numeric(gsub("-", "",score$RISK_SCORE ))
#Delete some columns
score[,2:4] <- NULL
ID | RISK_ALLELE_2 | RISK_SCORE_2 |
---|---|---|
1:110365045_G_A | A | 0.04020 |
1:172864224_A_G | A | 0.04088 |
1:183056222_T_C | T | 0.07060 |
1:201885446_G_T | G | 0.03869 |
1:205163798_G_A | G | 0.07142 |
1:222045446_G_T | T | 0.05670 |
R
2 PRS calculation:
LINUX Terminal
Result file:
- PRS.sscore
3 PRS score rescaling
The SCORE_AVG values are initially in a very small range and are
rescaled to larger values to enhance result interpretability. Using the
scale function -> https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/scale
R
IID | NMISS_ALLELE_CT | ALLELE_DOSAGE_SUM | SCORE1_AVG | SCORE1_AVG_scale |
---|---|---|---|---|
A00002 | 398 | 184 | 0.0295023 | -0.45793566 |
A00005 | 398 | 187 | 0.0289993 | -0.82414216 |
A00011 | 398 | 188 | 0.0291383 | -0.72294394 |
A00012 | 398 | 176 | 0.0284477 | -1.22573163 |
A00014 | 398 | 186 | 0.0296143 | -0.37639465 |
A00016 | 398 | 188 | 0.0302292 | 0.07128005 |
R
STEP 5 - PRS results
Then and as a last step, the results are reviewed and plotted to verify that the calculated PRS really discriminates between CRC cases and controls.
- 1 The calculated scores are combined with the sample phenotypes (case or control).
- 2 A t.test is performed between cases and controls.
- 3 The scores are plotted based on the case-control status.
Assign the phenotype
Add phenotypes of interest
The phenotypic data of the target data are loaded.
R
ID | tipo_cancer |
---|---|
A00002 | Control |
A00005 | Control |
A00011 | Control |
A00012 | Control |
A00014 | Control |
A00016 | Control |
merge scores + phenotypes
Use of all CRC and all Controls selected in the 0_Data_sample_preparation.
R
IID | NMISS_ALLELE_CT | ALLELE_DOSAGE_SUM | SCORE1_AVG | SCORE1_AVG_scale | tipo_cancer |
---|---|---|---|---|---|
A00002 | 398 | 184 | 0.0295023 | -0.45793566 | Control |
A00005 | 398 | 187 | 0.0289993 | -0.82414216 | Control |
A00011 | 398 | 188 | 0.0291383 | -0.72294394 | Control |
A00012 | 398 | 176 | 0.0284477 | -1.22573163 | Control |
A00014 | 398 | 186 | 0.0296143 | -0.37639465 | Control |
A00016 | 398 | 188 | 0.0302292 | 0.07128005 | Control |
Statistics
A t.test is performed between cases and controls to check if there are differences between the calculated scores of these two groups.
R
# plots
ttest<-t.test(scores_samples$SCORE1_AVG_scale~scores_samples$tipo_cancer)
tpval<- t.test(scores_samples$SCORE1_AVG_scale~scores_samples$tipo_cancer)$p.value
tpval
## [1] 2.498608e-45
PRS Plots
Histogram
R
# Histogram:
histo <- ggplot(scores_samples, aes(x = SCORE1_AVG_scale))+
geom_histogram(aes(y=..density.. ,fill = tipo_cancer, color = tipo_cancer), alpha = 0.5, position = "identity") +
scale_fill_manual( values = c("steelblue4", "gray87")) +
scale_color_manual(values = c("steelblue4", "gray50"))+
theme_classic()+
xlab("Polygenic risk score (PRS)") +
ylab("Density")+
#labs(fill="xyz")+
labs(title = "Colorectal cancer", subtitle = "p <0.0001 ")+
theme(text = element_text(size = 16),
legend.position = "bottom") +
theme(legend.title=element_blank())
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Density plot
R
# Density plot
density <- ggplot(scores_samples, aes(x = SCORE1_AVG_scale, color = tipo_cancer))+
geom_density()+
scale_color_manual(values=c("#99B1C4", "gray50"))+
labs(title = "Risk score distribution",
#subtitle = "Plot of length by dose",
caption = paste0("t-test pval=",tpval))+
theme_minimal()+
theme(text = element_text(size = 20))
Boxplot
R
# Boxplot
boxplot <- ggplot(scores_samples, aes(x=tipo_cancer, y=SCORE1_AVG_scale, fill=tipo_cancer)) +
geom_boxplot()+
labs(x="", y = "Risk Score")+
scale_x_discrete(limits=c("Control", "Colorectal"))+
scale_fill_manual(values=c("#99B1C4", "gray87"))+
theme_classic()+
theme(text = element_text(size = 20), legend.position = "none")
qqplot
R
# qqplot
qwe1<-qqnorm(scores_samples$SCORE1_AVG_scale[scores_samples$tipo_cancer =="Control"],plot.it=FALSE)
qwe2<-qqnorm(scores_samples$SCORE1_AVG_scale[scores_samples$tipo_cancer=="Colorectal"],plot.it=FALSE)
plot(qwe1$x,qwe1$y,pch=19,las=1,main="",col="darkgrey",xlab="Theoretical Quantiles",ylab="Sample Quantiles")
qqline(scores_samples$SCORE1_AVG_scale[scores_samples$tipo_cancer== "Control"],col="darkgrey",lwd=2)
points(qwe2$x,qwe2$y,pch=19,col="#99B1C4")
qqline(scores_samples$SCORE1_AVG_scale[scores_samples$tipo_cancer== "Colorectal"],col="#99B1C4",lwd=2)
title("Risk score Q-Q Plot",sub=paste0("t-test pval=",tpval))
Extra information
Liftover
Transform data from a genome version to another genome version
Suppose our data is in hg19 and we want to impute using TOPMed panel as reference hg19 to hg38 https://github.com/sritchie73/liftOverPlink
Download the chain file that allows the mass conversion of coordinates from one assembly to another. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/
or see https://genviz.org/module-01-intro/0001/06/02/liftoverTools/
Credits
This document was created with Rmarkdown by Nuria Moragas and Victor Moreno
- Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), 28029 Madrid, Spain
- Oncology Data Analytics Program, Catalan Institute of Oncology (ICO), L’Hospitalet de Llobregat, 08908 Barcelona, Spain
- Colorectal Cancer Group, ONCOBELL Program, Institut de Recerca Biomedica de Bellvitge (IDIBELL), L’Hospitalet de Llobregat, 08908 Barcelona, Spain
- Department of Clinical Sciences, Faculty of Medicine and health Sciences and Universitat de Barcelona Institute of Complex Systems (UBICS), University of Barcelona (UB), L’Hospitalet de Llobregat, 08908 Barcelona, Spain
This work is licensed under CC BY-NC-SA 4.0