Polygenic Risk Score (PRS) Tutorial

OVERVIEW

The purpose of this tutorial is to furnish a step-by-step guide on the execution of a Polygenic Risk Score (PRS) analysis, accompanied by an introduction to fundamental genetic principles germane to this methodology.

The tutorial is divided into two sections, First, a series of basic concepts about genetics and PRS are introduced at a theoretical level. Secondly, the tutorial itself consists of 6 steps which are detailed below.

Step-by-step PRS analysis:

  1. STEP 0: Data selection + QC check
  2. STEP 1: SNPs Extraction
  3. STEP 2: SNPs Alignament check
  4. STEP 3: Quality control (QC)
  5. STEP 4: PRS calculation
  6. STEP 5: PRS resultsn

Software requirements

- R packages

R packages required:

R

library(readxl)
library(data.table)
library(ggplot2)


Required Data

To carry out this PRS analysis it is necessary two types of data. Here is the link to download the data used in the tutorial. If you want to use your own data, see what is necessary in the section STEP 0 - Data selection + QC check.

You can download it directly here: Download - 1_CRC_SNPs_list_205_Nature_2023

Attention: This file has an error in column “POS (GRCh37)” for some SNPs, the position was wrong, when compared to column “VARIANT”. The file in data/resources has been fixed.
The same file can be downlaoded from PSG catalog (https://www.pgscatalog.org/score/PGS003850/). However, this file has the same errors and doesn’t have the column VARIANT to help fixing it.


  • TARGET DATA. Imputed genotyping data from a population of interest. Simulated data based on GenRisk data will be used here. For reasons of space, only the data of the patients with the SNPs associated with CRC can be downloaded (document obtained in STEP 1) Download - GenRisk_all_CRC_SNPs_perm.vcf .

To be used from the section Step 1 - SNP extraction. 2 VCF files concatenation

  • METADATA. Phenotypic data of the target data. In this case which of the samples is controlled and which CRC. Metadata_GenRIsk -

Attention: Whether you wish to utilize your own data or existing datasets, it’s important to note that the data used should have already undergone a minimum quality control (QC) process, with hg37 as the reference genome. For the target data, it has been imputed using the 1000 Genomes reference.

For more information regarding the format of the data to be used or how to adapt see here

Info

This tutorial requires R and a linux terminal. You can use Rstudio or Visual Studio Code (with R plugins), that include a terminal.

Both in R and in terminal, first check that you are in the correct directory. In R remember to load above mentioned libraries.

You can clone our github repository to reproduce this practical. The exception is STEP 1, that requires real imputed genotype data. We canot share real data, but we provide fake data resulting from the pipeline in directort data/resources)

LINUX Terminal

# clone the github repository

git clone https://github.com/odap-ico/PRS_tutorial
cd PRS_tutorial

What’s a Polygenic Risk Score


A Polygenic Risk Score (PRS) is a single numerical metric that estimates an individual’s genetic risk for a particular disease or trait (phenotype). This score is calculated by summing the contributions of risk alleles carried by the individual, each weighted by its respective effect size as determined by a Genome-Wide Association Study (GWAS) specific to the phenotype in question.

In essence, the PRS encapsulates the combined impact of various genetic variants, predominantly Single Nucleotide Polymorphisms (SNPs), within an individual’s genome. A higher PRS indicates a greater genetic predisposition, while a lower score suggests a reduced risk, thereby facilitating the prediction of disease susceptibility or trait expression based on one’s genetic makeup. This approach offers valuable insights into the genetic architecture of complex traits and diseases.

Effect size - βi

In PRS, there exist two primary approaches:

  • Unweighted PRS, where each genetic variant is assigned a fixed weight of 1 (βi = 1).

  • Weighted PRS, which employs the actual effect sizes, typically represented as betas (βi) or Odds Ratios (OR), derived from GWAS specific to the phenotype in question.

The choice between these two approaches is contingent upon several key factors.

¿When to use an unweighted PRS (βi = 1)?

  • Your data have been used for the GWAS -> model overfitting
  • Poorly robust GWAS

PRS applications

  • Disease risk prediction
  • Screening and prevention.
  • Personalized treatment
  • Epidemiological Studies
  • Drug Development

Step-by-step PRS analysis

The step-by-step analysis structure that will be followed to perform the PRS analysis is as follows:

  1. STEP 0: Data selection + QC check. The process begins with the careful selection of genetic data (SNPs data and target data) relevant to the study, ensuring that it accomplishes general quality control standards commonly used in GWAS analyses.

  2. STEP 1: SNPs Extraction. Consists of the extraction of the SNPs associated with the study characteristic or disease from the total pool of SNPs in the target dataset where the PRS will be calculated.

  3. STEP 2: SNPs Alignament check. Discrepancies, such as missing SNPs, duplicates, variations in strand orientation, and more, can occur between the SNPs in the SNP dataset and the target data. These disparities may arise due to differences in platforms, imputation methods, or genome builds. Therefore, it is essential to carefully evaluate and resolve such differences whenever possible.

  4. STEP 3: Quality control (QC). Imputed genotyping data, QC is performed both before and after imputation, ensuring the use of high-quality data and the exclusion of poorly imputed variants.

  5. STEP 4: PRS calculation. Calculate the PRS by combining the genetic variants (SNPs) and their associated effect sizes in the target data.

  6. STEP 5: PRS results. This involves drawing meaningful insights from the calculated scores and making conclusions about genetic predisposition for the studied phenotype.

Other Basic Genetic Concepts

STEP 0 - Data selection + QC check

Below is detailed where to find, in which formats, what minimum information is required as well as what minimum QC the data must have passed.

SNPs Data Selection

The initial phase in the analysis of the Polygenic Risk Score (PRS) involves acquiring the base data, which comprises a list of SNPs associated with the specific trait or condition of interest.

These SNP datasets can be sourced from publicly available GWAS studies found in the literature or various genetic databases. Alternatively, researchers can conduct their own GWAS to generate the required summary statistics.

For the purpose of this tutorial, we will utilize SNP data associated with Colorectal Cancer (CRC) as our reference dataset.

SNPs data are a set of individual SNPs that have been identified as related to the characteristic or trait studied through a GWAS study. These data usually include the position of the SNP in the genome, the estimated genetic effect (such as odds ratio or beta), the allelic frequency, and other related statistics.


- Where to find:


- Minimum information required:

  • Chromosome nº
  • Position
  • Effect and non-effect allele
  • Effect size (betas or OR)


- QC check:

It is necessary to ensure that the list of SNPs selected has passed a minimum QC check. -> those applied to a GWAS

  • Genome build -> Check that the reference genome is the same between the SNP data and the target data. Otherwise, it is necessary to carry out a Liftover. Better perform on the data of the SNPs of interest.
  • Effect allele
  • Standard GWAS QC:
    • Genotyping rate >0,99
    • Remove SNPs with >5% missing data
    • Heterozygosity. Remove samples > mean(proportion of heterozygous sites ) ± 3 * sd(proportion of heterozygous sites)
    • Exclude individuals with sex discordance
    • Imputation information score (INFO score). > 0.3 – 0.8
    • Minor Allele Frequency (MAF). > 0.005 – 0,001
    • Linkage-disequilibrium (LD) score
    • Hardy-Weinberg equilibrium (HWE) P>1x10-6

Target Data

TARGET DATA means the genotyping data of a population already imputed. (Usually own data - collaborations)

This data can be in multiple formats some of the most common are detailed below. (Depending on where they are downloaded or with which the charge is made).

- Imputed genotype data formats:

  • Variant Call Format (VCF). e.g chr.{i}vcf.gz
  • PLINK 1.9 format [.bed] [.bim] [.fam]
  • BGEN format [.bgen], [.sample], [.bgen.bgi]
  • PLINK 2 format [.pgen], [.pvar], [.psam]

NOTE: Generally, imputation with 1000 genomes or TopMed results in vcf files separated by chromosomes. This tutorial starts with data in this format.

- QC check:

  • More than 100 individuals
  • Genome build
  • No sample overlap with SNPs GWAS samples
  • Standard GWAS For samples:
    • Individuals genotype rate >0.99
    • Remove sample with >10% missing data
    • Heterozygosity. Remove samples > mean(proportion of heterozygous sites ) ± 3 * sd(proportion of heterozygous sites)
    • Exclude individuals with sex discordance
    • Remove duplicates and relatedness samples. PI_HAT > 0.8 For SNPs:
    • Imputation information score (INFO score). > 0.3 – 0.8
    • Minor Allele Frequency (MAF). > 0.01 - 0.005
    • Linkage-disequilibrium (LD) score
    • Hardy-Weinberg Equilibrium (HWE) p > 1x10-6

ATTENTION! The quality control for this data is conducted both before and after imputation. Post-imputation quality control (QC) related to SNPs can be carried out either before initiating the Polygenic Risk Score (PRS) analysis or during the analysis. In this instance, SNP QC is performed after extracting the SNPs of interest from the Target Data. This approach is chosen because the dataset is smaller, making the process faster.

STEP 1 - SNPs Extraction

This section consists of two parts:

SNPs list preparation

Preparation of a list of SNPs associated with CRC for extraction in the target data. A .txt file with two columns:

  • Chromosome number
  • SNP position
Example .txt file with the identifiers of the SNPs of interest.
Example .txt file with the identifiers of the SNPs of interest.


ATTENTION! If the reference genome does not match the list of SNPs associated with the study trait and the target data to be used, a Liftover must be performed, see here.

In this case both the list of SNPs and the target date are in GRCh37.

- Load SNPs associated to CCR:

SNPs Data from Ceres Fernandez-Rozadilla et al. Nature 2023, saves as: .1_CRC_SNPs_list_205_Nature_2023.xlsx in resources folder.

R

# load data
CRC_SNP <- read_excel("data/resources/1_CRC_SNPs_list_205_Nature_2023.xlsx")

# The first 5 rows and first 7 columns are shown
CRC_SNP[c(1:5),c(1:7)]


CHR POS (GRCh37) VARIANT rsID EFFECT_ALLELE OTHER_ALLELE BETA
1 22503282 1:22503282_G/C rs2807367 C G -0.03790
1 22710877 1:22710877_G/C rs34963268 C G -0.06589
1 38461319 1:38461319_C/A rs61776719 A C -0.05610
1 55247852 1:55247852_A/G rs12143541 G A 0.08250
1 62673037 1:62673037_T/C rs7542665 C T 0.01850


- Cration of chr position .txt file

The CHR and POS columns (GRCh37) are saved in a .txt file.

R

# .txt file generation
SNP_file_CRC <- CRC_SNP[,c(1,2)]
head(SNP_file_CRC)


CHR POS (GRCh37)
1 22503282
1 22710877
1 38461319
1 55247852
1 62673037
1 71040166

- Save .txt file

R

# two avoid scientific annotation
SNP_file_CRC$`POS (GRCh37)` <- format(SNP_file_CRC$`POS (GRCh37)`, scientific=FALSE)

# create folder data/1_snps_extraction/ and save as .txt
dir.create("data/1_snps_extraction/", recursive=TRUE)
write.table(SNP_file_CRC, "data/1_snps_extraction/1_SNP_file_CRC.txt", quote = F, col.names = F, row.names = F, qmethod = "double", sep = "\t")

SNPs extraction

The aim of this step is to extract data on SNPs associated with colorectal cancer from GenRisk data.

It consists of 3 substeps:

- GenRisk imputed data

In this case, the GenRisk data already imputed by 1000 genomes will be used. Data imputed with TopMed or 1000Genomes servers return data in vcf.gz format:

GenRisk Imputed data files:
GenRisk Imputed data files:

ATTENTION! The data is substantial and cannot be shared. As a solution, the code for extracting CRC SNPs using the Linux bcftools tool, chromosome by chromosome, will be demonstrated without execution. Next, the process of merging individual files from each chromosome will be presented. This final outcome is what will be made available to proceed with the exercise.

In other words, the data to which you have access are the results of the VCF files concatenation section and are called: GenRisk_all_CRC_SNPs.vcf.gz

On the other hand, the shared data have been modified to ensure compliance with personal data protection regulations.

R

# this is specific to each setting

# set the directory containing the imputed genotype data.
data_dir <-Sys.getenv("IMPUTATION_1000G_data") 

# directory for SNP extraction
vcf_dir <- "data/1_snps_extraction" 

- 1 SNPs extraction:

LINUX Terminal

# Extraction of the SNPs in a loop for the different chromosomes. 

## bcftool view <directory/_chr*_vcf.gz> -R <directory/SNP_list.txt> -o <directory/output_file.vcf.gz>
### view -> allows filtering and viewing data contained in VCF files.
### -R -> Region. specific region of the genome
### -o -> Output. 


for i in {1..22}
do
bcftools view data_dir/GenRisk_chr${i}.vcf.gz -R data/1_snps_extraction/1_SNP_file_CRC.txt -o data/1_snps_extraction/GenRisk_chr${i}_CRC_SNPs.vcf.gz
done


#time 1 min 30 s


The application of this code generates new vcf files for each chromosome, only with the SNPs present in the file 1_SNP_file_CRC.txt. The following vcf files are generated:

CRC data files
CRC data files

- 2 VCF files concatenation:

Merge of GenRisk_chr${i}_CRC_SNPs.vcf.gz into a single file:

R

# Creation of a file concatenation list: 

chromList<-list.files(path = vcf_dir,  pattern="GenRisk_chr*")
write.table(chromList,file= paste0(vcf_dir,"/2_chromList.txt"),quote=FALSE,row.names = FALSE,col.names = FALSE)


This is the list of files to be concatenated that is created:


LINUX Terminal


# Concatenation:
bcftools concat -f data/1_snps_extraction/2_chromList.txt -Oz -o data/1_snps_extraction/GenRisk_all_CRC_SNPs_perm.vcf.gz

# time 6s

GenRisk_all_CRC_SNPs_perm.vcf.gz is the data that can be downloaded (folder data/resources) and worked from here.


NOTE: The GenRisk_chr${i}_CRC_SNPs.vcf.gz files created during SNP extraction are intermediate, and it is recommended to delete them later to avoid unnecessary space consumption

LINUX Terminal

# Remove the individual chromosome files*

rm data/1_snps_extraction/GenRisk_chr*_CRC_SNPs.vcf.gz


STEP 2 - SNPs Alignament check

Inconsistencies, including absent SNPs, duplications, variations in strand orientation, and other discrepancies, may exist between the SNP dataset and the target data. These variations could stem from divergences in platforms, imputation methods, or genome builds. Hence, it is crucial to thoroughly assess and address these distinctions whenever feasible.

These discrepancies can be:

Load data

R

# re-load original CRC SNPs list.
CRC_SNP <- read_excel("data/resources/1_CRC_SNPs_list_205_Nature_2023.xlsx")

R

# Load the data of the SNPs extracted from the target data:
pvar<-fread("data/1_snps_extraction/raw_data.pvar")

colnames(pvar)<- c("CHR",   "POS",  "ID"    ,"REF", "ALT",  "FILTER",   "INFO")
head(pvar)


CHR POS ID REF ALT FILTER INFO
1 22503282 1:22503282_G_C G C PASS AF=0.63412;MAF=0.36588;R2=0.94997;IMPUTED;AC=5515;AN=8590
1 22710877 1:22710877_G_C G C PASS AF=0.18103;MAF=0.18103;R2=0.96771;IMPUTED;AC=1590;AN=8590
1 38461319 1:38461319_C_A C A PASS AF=0.49037;MAF=0.49037;R2=0.79922;IMPUTED;AC=4458;AN=8590
1 55247852 1:55247852_A_G A G PASS AF=0.12046;MAF=0.12046;R2=0.87593;IMPUTED;ER2=0.83497;TYPED;AC=1028;AN=8590
1 62673037 1:62673037_T_C T C PASS AF=0.66593;MAF=0.33407;R2=0.99843;ER2=0.96251;TYPED;AC=5727;AN=8590
1 71040166 1:71040166_G_T G T PASS AF=0.38298;MAF=0.38298;R2=0.80064;IMPUTED;ER2=0.84;TYPED;AC=3339;AN=8590


Missing SNPs

The selected SNPs may not all be available in the target dataset (Missing SNPs), which could be due to:

  • Different Genotyping / imputation platform
  • SNP have not passed QC (low quality)

Although it looks like all SNPs have been found (205) Let’s see if any are missing.

Find SNPs by chr:position

R

# Match variant format
CRC_SNP$id_pos <- paste0(CRC_SNP$CHR,":",CRC_SNP$`POS (GRCh37)`)
pvar$id_pos <- paste0(pvar$CHR,":",pvar$POS)


# SNPs not found
snpsNOTpvar<-CRC_SNP[!(CRC_SNP$id_pos %in% pvar$id_pos), ]


QUESTION How many variables were found to be missing in our data? Which one?

R

snpsNOTpvar


CHR POS (GRCh37) VARIANT rsID EFFECT_ALLELE OTHER_ALLELE BETA SE P REF PMID PMCID DOI PUBLISHED_SNP LD across all included populations id_pos
X 9763898 23:9763898_C/G rs2732875 C G 0.169 0.024 * 8.3E-12 * Dunlop MG, Nat Genet, 2012 22634755 PMC4747430 10.1038/ng.2293 rs5934683 R2:0.28; D’:0.94 X:9763898


1 variable have been identified as missing. The missing SNP is the one on the X chromosome, as sex chromosome data is not being used.


QUESTION What is the problem with the 205 SNPs extracted if we already know that one is missing?

The problem is that if the missing X chromosome SNP is subtracted from the initial 205 SNPs, theoretically, we would expect to extract 204 SNPs, indicating an extra SNP. There is a duplicated/multi-allelic/bialellic SNP?


Multi-allelic / Duplicated / Biallelic SNPs

Let’s check if there are any duplicate chr:position IDs:

R

table(duplicated(pvar$id_pos))
## 
## FALSE  TRUE 
##   204     1


Yes, there is a duplicated chromosome:position ID. Which one?

R

multi_all <- pvar[duplicated(pvar$POS) | duplicated(pvar$POS, fromLast = TRUE), ]
multi_all


CHR POS ID REF ALT FILTER INFO id_pos
12 64404555 12:64404555_T_A T A PASS AF=0.0005;MAF=0.0005;R2=0.56084;IMPUTED;AC=3;AN=8590 12:64404555
12 64404555 12:64404555_T_C T C PASS AF=0.46244;MAF=0.46244;R2=0.80188;IMPUTED;AC=3995;AN=8590 12:64404555


To know which one is the incorrect, it must be compared with the original listing.

R

# the format of the variable ID between tables is equalized. From chr:position_other_allele/effect_allele to chr:position_other_allele_effect_allele
CRC_SNP$ID <- gsub("/","_",CRC_SNP$VARIANT)

#Which is the correct?
id_multi <- multi_all[!multi_all$ID %in% CRC_SNP$ID,] 
id_multi


CHR POS ID REF ALT FILTER INFO id_pos
12 64404555 12:64404555_T_A T A PASS AF=0.0005;MAF=0.0005;R2=0.56084;IMPUTED;AC=3;AN=8590 12:64404555

R

#Save the incorrect SNP to delete in the QC step:
## Structure: CHROM  ID      REF     ALT
## ID = chr:pos_REF_ALT

id_multi_to_remove <- id_multi[,c(1,3:5)]
dir.create("data/2_SNPs_alignament_check")

write.table(id_multi_to_remove,"data/2_SNPs_alignament_check/multi_to_remove.txt", col.names = F, row.names = F, quote = F)


This incorrect SNP should be removed from the data. This process is carried out together with the quality control in STEP 3 - QC.


Mismatching SNPs

It is essential to verify whether the REF and ALT alleles in the target data match those in the SNP list. For example, in the target data, there may be SNPs with A/C alleles in the base data and G/T alleles in the target data. This discrepancy can occur when one dataset was genotyped with reference to the forward strand, and the other was genotyped with reference to the reverse strand.

SNPs that have mismatched alleles reported in the original SNP list and the target data can be resolved by strand-flipping the alleles to their complementary alleles. Most polygenic scoring software programs perform strand shifting automatically for resolvable SNPs and remove mismatched non-resolvable SNPs.

SNPs with mismatched alleles between the SNP list and the target data can be resolved by “strand-flipping” the alleles to their complementary pairs. Most polygenic scoring software programs automatically perform strand flipping for resolvable SNPs and exclude non-resolvable SNPs with mismatches.

PLINK2 does not do this automatically, if there is a mismatching SNP in the data, the “strand-flipping” must be done manually.


All SNPs are then verified to have the correct alleles, using ID = chr:position_other_allele_effect_allele.

R

## SNPs that have been extracted: 
dim(CRC_SNP[CRC_SNP$ID %in% pvar$ID,])
## [1] 204  17

R

## SNPs that have not been extracted: 
no_snps <- CRC_SNP[!CRC_SNP$ID %in% pvar$ID,]
no_snps


CHR POS (GRCh37) VARIANT rsID EFFECT_ALLELE OTHER_ALLELE BETA SE P REF PMID PMCID DOI PUBLISHED_SNP LD across all included populations id_pos ID
X 9763898 23:9763898_C/G rs2732875 C G 0.169 0.024 * 8.3E-12 * Dunlop MG, Nat Genet, 2012 22634755 PMC4747430 10.1038/ng.2293 rs5934683 R2:0.28; D’:0.94 X:9763898 23:9763898_C_G


It has been confirmed that all SNPs have the correct alleles, except for the SNP on the X chromosome


Ambiguous SNPs

Ambiguous (Palindromic) SNPs are those in which the allelic variants are complementary. In the context of PRS calculation, they are SNPs where the two allelic variants are symmetrical (e.g., A/T in the SNP list and T/A in the target data) and have similar frequencies, making it difficult to distinguish which variant is the effective one. If you have access to the effect allele frequency (EAF) data from the list of SNPs of interest, you can compare it with the EAF data in your dataset to determine which allele is the reference (REF). However, EAF data is rarely available in downloadable SNP lists.

In cases where EAF data is not available or when EAF is close to 0.5, these ambiguous SNPs should be excluded. In PRS calculations, SNPs with allelic frequencies close to 0.5 (between 0.4 and 0.6) are considered ambiguous and are typically excluded.




To check if there are ambiguous SNPs in the data, first load the EAF data from the target data obtained during the extraction of the SNPs and apply an EAF filter >0.4 | <0.6:

R

# load SNPs frequencyes  
freq<-read.table("data/1_snps_extraction/raw_data.afreq",header=F,sep="\t")
colnames(freq)<- c("CHROM", "ID",   "REF",  "ALT",  "ALT_FREQS",    "OBS_CT")
head(freq)


CHROM ID REF ALT ALT_FREQS OBS_CT
1 1:22503282_G_C G C 0.641909 8590
1 1:22710877_G_C G C 0.185099 8590
1 1:38461319_C_A C A 0.518859 8590
1 1:55247852_A_G A G 0.119674 8590
1 1:62673037_T_C T C 0.666705 8590
1 1:71040166_G_T G T 0.388591 8590


R

ambiguous_snps <- freq[freq$ALT_FREQS > 0.4 & freq$ALT_FREQS< 0.6 & 
                        ((freq$REF == "A" & freq$ALT == "T") | 
                         (freq$ALT == "T" & freq$REF == "A") | 
                         (freq$REF == "C" & freq$ALT == "G") | 
                         (freq$ALT == "G" & freq$REF == "C")), ]


QUESTION How many ambiguous SNPs were found?

R

ambiguous_snps


CHROM ID REF ALT ALT_FREQS OBS_CT
57 13 13:34093518_C_G C G 0.406170 8590
75 15 15:33009574_C_G C G 0.477532 8590
85 16 16:86338288_A_T A T 0.515600 8590

There are 3 ambiguous SNPs


Those SNPs are saved to a .txt file to be removed later.

R

# Save
write.table(ambiguous_snps, "data/2_SNPs_alignament_check/exclrsIDs_ambiguous.txt", sep = "\t", quote = FALSE, row.names = FALSE)


STEP 3 - Quality control (QC)

It is important to ensure that the SNPs chosen to calculate the PRS present a minimum of quality in the Target data. For this reason, the following points will be reviewed:

Any SNP that does not pass this QC will be removed in the final step of this section, Apply QC exclusions.

a) Imputation information score (INFO)

Imputation is a process used to predict or estimate genotypic information for unobserved or missing genetic variants (SNPs) in a dataset based on available genotype data and reference panels.

The imputation information (INFO score – r2) statistic is a measure of imputation quality which typically takes values between 0 and 1, where 0 indicates complete uncertainty and 1 represents complete certainty about the imputed genotype.

Variants with a low imputation information score (r^2 < 0.3 - 0.8) are often regarded as less reliable and are typically excluded from further analyses.

In this case a threshold of 0.4 is used.

R

# SNPs target date, already loaded. -> pvar

#pvar<-read.table("data/1_snps_extraction/raw_data.pvar",header=F,sep="\t")
#colnames(pvar)<- c("CHR",  "POS",  "ID"    ,"REF", "ALT",  "FILTER",   "INFO")
head(pvar)


CHR POS ID REF ALT FILTER INFO id_pos
1 22503282 1:22503282_G_C G C PASS AF=0.63412;MAF=0.36588;R2=0.94997;IMPUTED;AC=5515;AN=8590 1:22503282
1 22710877 1:22710877_G_C G C PASS AF=0.18103;MAF=0.18103;R2=0.96771;IMPUTED;AC=1590;AN=8590 1:22710877
1 38461319 1:38461319_C_A C A PASS AF=0.49037;MAF=0.49037;R2=0.79922;IMPUTED;AC=4458;AN=8590 1:38461319
1 55247852 1:55247852_A_G A G PASS AF=0.12046;MAF=0.12046;R2=0.87593;IMPUTED;ER2=0.83497;TYPED;AC=1028;AN=8590 1:55247852
1 62673037 1:62673037_T_C T C PASS AF=0.66593;MAF=0.33407;R2=0.99843;ER2=0.96251;TYPED;AC=5727;AN=8590 1:62673037
1 71040166 1:71040166_G_T G T PASS AF=0.38298;MAF=0.38298;R2=0.80064;IMPUTED;ER2=0.84;TYPED;AC=3339;AN=8590 1:71040166


In pvar the data about INFO (r^2) is in the column called INFO where there is a lot of other data. R2 is extracted and stored in a new column called R2.

eg. INFO estructure AF=0.63412;MAF=0.36588;R2=0.94997;IMPUTED;AC=9067;AN=14206 or AF=0.12046;MAF=0.12046;R2=0.87593;IMPUTED;ER2=0.83497;TYPED;AC=1687;AN=14206

The information to be extracted can be seen in bold.


R

# Create column R2
pvar$INFO2 <- sub("ER2=.*", "", pvar$INFO)
pvar$R2 <- as.numeric(sub('.*R2=([0-9.]+);.*', '\\1', pvar$INFO2))


QUESTION How many SNPs exhibit an imputation value (R²) lower than 0.4? Which are these snps?

R

# R2 >0.4
table(pvar$R2 >= 0.4)
## 
## FALSE  TRUE 
##     2   203


2 are the SNPs that have a low imputation score and must be eliminated.




R

# Extraction of SNPs with low INFO:

r2_04_remove <- pvar[pvar$R2 < 0.4,]
r2_04_remove


CHR POS ID REF ALT FILTER INFO id_pos INFO2 R2
1 245181421 1:245181421_T_C T C PASS AF=0.00087;MAF=0.00087;R2=0.06303;IMPUTED;AC=0;AN=8590 1:245181421 AF=0.00087;MAF=0.00087;R2=0.06303;IMPUTED;AC=0;AN=8590 0.06303
12 31594813 12:31594813_C_T C T PASS AF=0.00089;MAF=0.00089;R2=0.03916;IMPUTED;AC=0;AN=8590 12:31594813 AF=0.00089;MAF=0.00089;R2=0.03916;IMPUTED;AC=0;AN=8590 0.03916


These SNPs will be added to a .txt file called to_remove later, to be filtered in the last step of this QC.


b) Minor allele frequency (MAF)

Minor Allele Frequency (MAF) indicates the rarity of a genetic variant within a population. Variants with a MAF greater than 5% are considered common, those with a MAF between 1% and 5% are categorized as low frequency, and variants with a MAF less than 1% are considered rare.

In PRS analyses, it’s common practice to filter out SNPs with low frequency because they may not have enough statistical power to provide meaningful insights. In this case, we will apply a threshold of 0.005 directly to the Plink2 function using the –maf flag. This threshold helps ensure that only SNPs with a MAF greater than or equal to 0.005 are included in the analysis, focusing on more common variants for robust and reliable results.

c) Linkage Disequilibrium (LD)

Linkage disequilibrium (LD) measures the connection between closely located genetic variants that often share inheritance patterns because of their proximity, leading to a genetic association within a population.

During the construction of a PRS, variants in high LD will have been identified and removed by methods such as clumping / pruning. This is done to reduce redundancy and bias in the PRS analysis, as it is desired that the markers used be as informative as possible without duplicating information.

In order to verify this, or apply more stringent thresholds, the PLINK 2 command –indep-pairwise can be used to prune the data, resulting in a subset of variants with LD below a specified threshold.

LINUX Terminal

# Pruning / clumping to remove highly correlated SNPs:

# make folder to save QC results
mkdir -p data/3_QC

# remove redundant SNPs 

plink2 \
    --pfile data/1_snps_extraction/raw_data \
    --indep-pairwise 200 1 0.3 \
    --out data/3_QC/LD
      
      ## "200" refers to the distance in kilobases (kb) between the SNPs
      ## "1" refers to the maximum number of SNPs that can be included in a clumping group.
      ## "0.3" is the correlation threshold (R^2) used to determine if two SNPs are "clumpable"



After applying this formula, two files are generated:

  • LD.prune.out -> Contains SNPs to be removed due to linkage disequilibrium.
  • LD.prune.in -> Contains SNPs that do not exhibit linkage disequilibrium.


QUESTION How many SNPs exhibit Linkage Disequilibrium?


Let’s see whitch SNPs present a linkage desequilibrium:

R

ld_snps <- read.table("data/3_QC/LD.prune.out",header=F,sep="\t")
ld_snps <- ld_snps$V1
length(ld_snps) 
## [1] 2
ld_snps
## [1] "1:245181421_T_C" "12:31594813_C_T"


d) Hardy-Weinberg Equilibrium (HWE)

Hardy-Weinberg Equilibrium (HWE) is a principle stating that allele and genotype frequencies should remain constant in a stable, unchanging population. Deviations from HWE may suggest genotyping errors, leading to the exclusion of such variants in analyses.

The PLINK 2 command –hwe filters out variants deviating from HWE using a specified p-value threshold.

HWE was previously tested in the GenRisk data; however, it is now confirmed using a threshold of 1e-6:

LINUX Terminal

plink2 \
    --pfile data/1_snps_extraction/raw_data \
    --hwe 1e-6 \
    --write-snplist \
    --out data/3_QC/HWE_QC



No variant should be deleted fro HWE.

Apply QC exclusions

All SNPs that have not passed QC, as well as ambiguous ones, are saved to a file called to_remove.txt and removed from the Target Data using PLINK2.

to_remove.txt creation

In the PRS calculation function, Plink only accepts a list of SNPs to be excluded. Therefore, we need to combine the list of:

  • SNP extracted and not included in the list of interests. 1 SNP, save as multi_to_remove.txt
  • Ambiguous SNPs. 3 SNPs, save as exclrsIDs_ambiguous.txt
  • SNPs with low INFO. 2 SNPs, save as r2_04_remove (r object).
  • LD SNPs. 2 SNPs, save as ld_snps (r object)

This combined list is saved in a file named “to_remove.txt.”

R

# The incorrect SNP extracted load.
multi_remove <-read.table("data/2_SNPs_alignament_check/multi_to_remove.txt", colClasses = c("character"))
colnames(multi_remove) <- c("CHROM", "ID", "REF", "ALT")
multi_remove
##   CHROM              ID REF ALT
## 1    12 12:64404555_T_A   T   A


R

# The ID columns are selected
ambiguous_snps_ID <- ambiguous_snps$ID
r2_04_remove_ID <- r2_04_remove$ID
multi_remove_ID  <- multi_remove$ID


The various SNPs lists are cross-referenced to identify matches.

R

table(ambiguous_snps_ID == multi_remove_ID)
## 
## FALSE 
##     3
table(ambiguous_snps_ID %in% r2_04_remove_ID)
## 
## FALSE 
##     3
table(multi_remove_ID %in% r2_04_remove_ID)
## 
## FALSE 
##     1
table(ld_snps %in% multi_remove_ID)
## 
## FALSE 
##     2
table(ld_snps %in% r2_04_remove_ID)
## 
## TRUE 
##    2
table(ld_snps %in% ambiguous_snps_ID)
## 
## FALSE 
##     2


QUESTION Finally, how many SNPs have failed QC and will be removed?
There are 2 SNPs that present both LD and a low imputation quality score. So in total we have 6 SNPs that have not passed the QC and will be removed.


R

# Union of the tables
to_remove <- unique(c(ambiguous_snps_ID,multi_remove_ID,r2_04_remove_ID,ld_snps ))
to_remove
## [1] "13:34093518_C_G" "15:33009574_C_G" "16:86338288_A_T" "12:64404555_T_A"
## [5] "1:245181421_T_C" "12:31594813_C_T"


R

# Save
write.table(to_remove, "data/3_QC/to_remove.txt", quote = FALSE, row.names = FALSE, col.names = F)


Filtering of SNPs with PLINK2

The next step involves data filtering to exclude SNPs in the to_remove.txt. On the other hand, it is still pending to apply the filter based on Minor Allele Frequency (MAF) using the –maf flag.

LINUX Terminal


#Scrip aplicat:
plink2 \
--pfile data/1_snps_extraction/raw_data \
--exclude data/3_QC/to_remove.txt \
--maf 0.005 \
--write-snplist \
--make-pgen \
--out data/3_QC/dataQC



The outputs of the command above are:

  • dataQC.pgen, dataQC.psam, dataQC.pvar: Data after SNP QC (excpet HWE) in PLINK 2 binary format
  • dataQC.snplist: List of IDs of SNPs that passed QC with the specified thresholds (from --write-snplist)
QUESTION What is the total number of SNPs remaining?


199 SNPs

From the initial 205, the following have been removed:

-1 for being on the X sex chromosome
-3 for ambiguous
-2 for low imputation score
-0 for low MAF
-2 (idem INFO) LD


STEP 4 - PRS calculation

  • 1 Prepare score.txt
  • 2 PRS calculation
  • 3 PRS score rescaling

1 Prepare score.txt

Creating a file with 3 columns:

  • ID of the SNPs that have passed the QC
  • The RISK_ALLELE
  • Betas

Thus ensuring that the risk allele according to the beta value.

This process is carried out to ensure that the PRS are consistent and accurately reflect genetic risk. In some genetic analyses, beta values associated with a SNP can be positive or negative depending on their association with a particular disease or trait. By changing the sign of negative betas to make them positive, it ensures that all genetic contributions are positive. This makes it easier to interpret them in terms of cumulative genetic risk.


Load data

It is necessary both the data of the SNPs of the target data and the beta values of the original list of the SNPs of interest. A series of modifications are made to match column names and data structure.

The ID in the target data has this structure: chr:position_ref_alt - e.x: 1:110365045_G_A.

In the case of the original SNPs list with beta data (representing the weight of each SNP in the trait under study), the structure is as follows: chr:position_other_risk, where the risk allele is determined by the sign of the beta value. In other words:

  • For SNP 1:22503282_G_C with a positive beta value, the risk allele is C.
  • For SNP 1:22503282_G_C with a negative beta value, the risk allele is G

R

# Load Info SNPs resultants del QC
snps_qc<-read.table("data/3_QC/dataQC.pvar")
colnames(snps_qc) <- c("CHROM", "POS",  "ID", "REF", "ALT", "FILTER", "INFO")

#Transform ID to: chr:position_all_1:all_2 format
#snps_qc$ID <- gsub("_",":",snps_qc$ID)

#Keep ID and ALT / RISK_ALLELE
names(snps_qc)[names(snps_qc) == "ALT"] <- "RISK_ALLELE"
names(snps_qc)[names(snps_qc) == "REF"] <- "OTHER_ALLELE"
snps_qc <- snps_qc[,c("ID", "RISK_ALLELE","OTHER_ALLELE" )]
head(snps_qc) 


ID RISK_ALLELE OTHER_ALLELE
1:22503282_G_C C G
1:22710877_G_C C G
1:38461319_C_A A C
1:55247852_A_G G A
1:62673037_T_C C T
1:71040166_G_T T G


From the data of the initial SNPs list we only save the ID and the beta value:

R

# Load beta values.
betas <- read_excel("data/resources/1_CRC_SNPs_list_205_Nature_2023.xlsx")

#Transform ID to: chr:position_all_1:all_2 format -> 1:110365045_A_G
names(betas)[names(betas) == "VARIANT"] <- "ID"
names(betas)[names(betas) == "BETA"] <- "RISK_SCORE"
#betas$ID <- gsub("_",":",betas$ID)
betas$ID <- gsub("/","_",betas$ID)

betas <- betas[,c(3,7)]
head(betas)


ID RISK_SCORE
1:22503282_G_C -0.03790
1:22710877_G_C -0.06589
1:38461319_C_A -0.05610
1:55247852_A_G 0.08250
1:62673037_T_C 0.01850
1:71040166_G_T -0.04060


Merge information

It is checked that beta values are available for all SNPs and the two tables are joined according to the IDs.

R

# Checking

table(snps_qc$ID %in% betas$ID)
## 
## TRUE 
##  199
# Merge 
score <- merge(snps_qc, betas, by = "ID", all.x = TRUE)
head(score)


ID RISK_ALLELE OTHER_ALLELE RISK_SCORE
1:110365045_G_A A G 0.04020
1:172864224_A_G G A -0.04088
1:183056222_T_C C T -0.07060
1:201885446_G_T T G -0.03869
1:205163798_G_A A G -0.07142
1:222045446_G_T T G 0.05670


setting all beta values to positive

Modification of the score_modif.txt list with the aim of setting all beta values to positive.

In cases where the beta value is negative:

1.- the risk alleles are exchanged - other 2.- The sign (- -> +) of the beta value is changed.

The results are saved in score_final.txt

R

#Create new RISK_ALLELE column to be able to modify it according to the beta value
score$RISK_ALLELE_2 <- score$RISK_ALLELE
head(score)


ID RISK_ALLELE OTHER_ALLELE RISK_SCORE RISK_ALLELE_2
1:110365045_G_A A G 0.04020 A
1:172864224_A_G G A -0.04088 G
1:183056222_T_C C T -0.07060 C
1:201885446_G_T T G -0.03869 T
1:205163798_G_A A G -0.07142 A
1:222045446_G_T T G 0.05670 T


R

#Change Risk_allele <-> Other_allele in the negative beta values cases:
for (i in 1:nrow(score)){
      
    # iterate over the column nº4 ("RISK_SCORE")  and check if <0
    if(score[i,4]<0){
           
        # replace the value with OTHER_ALLELE (column 3) in the new column RISK_ALLEL_2 (5)
        score[i,5]<- score[i,3]
        }
}

#Change beta value sign - to -> +
score$RISK_SCORE_2 <- as.numeric(gsub("-", "",score$RISK_SCORE ))


#Delete some columns
score[,2:4] <- NULL
head(score)
</div>
<br>
ID RISK_ALLELE_2 RISK_SCORE_2
1:110365045_G_A A 0.04020
1:172864224_A_G A 0.04088
1:183056222_T_C T 0.07060
1:201885446_G_T G 0.03869
1:205163798_G_A G 0.07142
1:222045446_G_T T 0.05670


R

#Save the new score doc as score_final.txt 

dir.create("data/4_PRS_calculation",recursive = TRUE)

write.table(score,"data/4_PRS_calculation/score_final.txt", col.names = F, quote = F, row.names = F)


2 PRS calculation:

LINUX Terminal

#PRS calculation 

plink2 \
--pfile data/1_snps_extraction/raw_data \
--extract data/3_QC/dataQC.snplist \
--score data/4_PRS_calculation/score_final.txt no-mean-imputation \
--out data/4_PRS_calculation/PRS



Result file:

  • PRS.sscore

3 PRS score rescaling

The SCORE_AVG values are initially in a very small range and are rescaled to larger values to enhance result interpretability. Using the scale function -> https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/scale

R

# Load results:
scores <- read.table("data/4_PRS_calculation/PRS.sscore")
colnames(scores)<- c("IID", "NMISS_ALLELE_CT",  "ALLELE_DOSAGE_SUM",    "SCORE1_AVG")

#Rescale the scores
scores$SCORE1_AVG_scale <- scale(scores$SCORE1_AVG, center = T, scale = T)
head(scores)


IID NMISS_ALLELE_CT ALLELE_DOSAGE_SUM SCORE1_AVG SCORE1_AVG_scale
A00002 398 184 0.0295023 -0.45793566
A00005 398 187 0.0289993 -0.82414216
A00011 398 188 0.0291383 -0.72294394
A00012 398 176 0.0284477 -1.22573163
A00014 398 186 0.0296143 -0.37639465
A00016 398 188 0.0302292 0.07128005


R


STEP 5 - PRS results

Then and as a last step, the results are reviewed and plotted to verify that the calculated PRS really discriminates between CRC cases and controls.

Assign the phenotype

Add phenotypes of interest

The phenotypic data of the target data are loaded.


R

# Samples to extract
samples<-read.table("data/resources/GenRisk_ImputedData_CRC_Samples.txt",header=TRUE,stringsAsFactors=FALSE,sep="\t")
head(samples)


ID tipo_cancer
A00002 Control
A00005 Control
A00011 Control
A00012 Control
A00014 Control
A00016 Control


R

table(samples$tipo_cancer)
## 
## Colorectal    Control 
##       1431       2864


merge scores + phenotypes

Use of all CRC and all Controls selected in the 0_Data_sample_preparation.


R

#Unir columnes samples i PRS.score per les ID de pacients
scores_samples<-merge(scores,samples,by.x="IID",by.y="ID",all=FALSE)
head(scores_samples)


IID NMISS_ALLELE_CT ALLELE_DOSAGE_SUM SCORE1_AVG SCORE1_AVG_scale tipo_cancer
A00002 398 184 0.0295023 -0.45793566 Control
A00005 398 187 0.0289993 -0.82414216 Control
A00011 398 188 0.0291383 -0.72294394 Control
A00012 398 176 0.0284477 -1.22573163 Control
A00014 398 186 0.0296143 -0.37639465 Control
A00016 398 188 0.0302292 0.07128005 Control



Statistics

A t.test is performed between cases and controls to check if there are differences between the calculated scores of these two groups.

R

# plots
ttest<-t.test(scores_samples$SCORE1_AVG_scale~scores_samples$tipo_cancer)
tpval<- t.test(scores_samples$SCORE1_AVG_scale~scores_samples$tipo_cancer)$p.value
tpval  
## [1] 2.498608e-45


PRS Plots

Histogram

R

# Histogram:
histo <- ggplot(scores_samples, aes(x = SCORE1_AVG_scale))+ 
  geom_histogram(aes(y=..density.. ,fill = tipo_cancer, color = tipo_cancer), alpha = 0.5, position = "identity") +
  scale_fill_manual( values = c("steelblue4", "gray87")) +
   scale_color_manual(values = c("steelblue4", "gray50"))+
   theme_classic()+
  xlab("Polygenic risk score (PRS)") + 
  ylab("Density")+
 
  #labs(fill="xyz")+
   labs(title = "Colorectal cancer", subtitle = "p <0.0001 ")+
   theme(text = element_text(size = 16),
         legend.position = "bottom") +
theme(legend.title=element_blank())
histo


## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Density plot

R

# Density plot
density <- ggplot(scores_samples, aes(x = SCORE1_AVG_scale, color = tipo_cancer))+ 
  geom_density()+
  scale_color_manual(values=c("#99B1C4", "gray50"))+
   labs(title = "Risk score distribution",
              #subtitle = "Plot of length by dose",
              caption = paste0("t-test pval=",tpval))+
   theme_minimal()+
  theme(text = element_text(size = 20)) 
density



Boxplot

R

# Boxplot
boxplot <- ggplot(scores_samples, aes(x=tipo_cancer, y=SCORE1_AVG_scale, fill=tipo_cancer)) +
  geom_boxplot()+
  labs(x="", y = "Risk Score")+
  scale_x_discrete(limits=c("Control", "Colorectal"))+
  scale_fill_manual(values=c("#99B1C4", "gray87"))+
  theme_classic()+
  theme(text = element_text(size = 20), legend.position = "none") 
boxplot



qqplot

R

# qqplot
qwe1<-qqnorm(scores_samples$SCORE1_AVG_scale[scores_samples$tipo_cancer =="Control"],plot.it=FALSE)
qwe2<-qqnorm(scores_samples$SCORE1_AVG_scale[scores_samples$tipo_cancer=="Colorectal"],plot.it=FALSE)

plot(qwe1$x,qwe1$y,pch=19,las=1,main="",col="darkgrey",xlab="Theoretical Quantiles",ylab="Sample Quantiles")
qqline(scores_samples$SCORE1_AVG_scale[scores_samples$tipo_cancer== "Control"],col="darkgrey",lwd=2)
points(qwe2$x,qwe2$y,pch=19,col="#99B1C4")
qqline(scores_samples$SCORE1_AVG_scale[scores_samples$tipo_cancer== "Colorectal"],col="#99B1C4",lwd=2)
title("Risk score Q-Q Plot",sub=paste0("t-test pval=",tpval))



Extra information

Liftover

Transform data from a genome version to another genome version

Suppose our data is in hg19 and we want to impute using TOPMed panel as reference hg19 to hg38 https://github.com/sritchie73/liftOverPlink

Download the chain file that allows the mass conversion of coordinates from one assembly to another. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/

liftOverPlink.py 
-m plinkFile.map 
-p plinkFile.ped 
-o plinkFile_lifted 
-c hg19ToHg38.over.chain.gz 

or see https://genviz.org/module-01-intro/0001/06/02/liftoverTools/


Credits

This document was created with Rmarkdown by Nuria Moragas and Victor Moreno

  • Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), 28029 Madrid, Spain
  • Oncology Data Analytics Program, Catalan Institute of Oncology (ICO), L’Hospitalet de Llobregat, 08908 Barcelona, Spain
  • Colorectal Cancer Group, ONCOBELL Program, Institut de Recerca Biomedica de Bellvitge (IDIBELL), L’Hospitalet de Llobregat, 08908 Barcelona, Spain
  • Department of Clinical Sciences, Faculty of Medicine and health Sciences and Universitat de Barcelona Institute of Complex Systems (UBICS), University of Barcelona (UB), L’Hospitalet de Llobregat, 08908 Barcelona, Spain

This work is licensed under CC BY-NC-SA 4.0