Polygenic Risk Score (PRS) Tutorial

OVERVIEW

The purpose of this tutorial is to furnish a step-by-step guide on the execution of a Polygenic Risk Score (PRS) analysis, accompanied by an introduction to fundamental genetic principles germane to this methodology.

The tutorial is divided into two sections, First, a series of basic concepts about genetics and PRS are introduced at a theoretical level. Secondly, the tutorial itself consists of 6 steps which are detailed below.

Step-by-step PRS analysis:

STEP 0: Data selection + QC check
STEP 1: SNPs Extraction
STEP 2: SNPs Alignament check
STEP 3: Quality control (QC)
STEP 4: PRS calculation
STEP 5: PRS resultsn

Software requirements

Linux terminal access with the following tools installed:
- PLINK 2 - https://www.cog-genomics.org/plink/2.0/
- bcftools - http://www.htslib.org/download/
R

- R packages

R packages required:

library(readxl)
library(data.table)
library(ggplot2)

Required Data

To carry out this PRS analysis it is necessary two types of data. Here is the link to download the data used in the tutorial. If you want to use your own data, see what is necessary in the section STEP 0 - Data selection + QC check.

SNPs DATA. List of SNPs associated with a characteristic of interest. This tutorial will use a list of 205 SNPs associated with colon cancer (CRC) from the article: Deciphering colorectal cancer genetics through multi-omic analysis of 100,204 cases and 154,587 controls of European and east Asian ancestries

You can download it directly here: Download - 1_CRC_SNPs_list_205_Nature_2023

Attention: This file has an error in column “POS (GRCh37)” for some SNPs, the position was wrong, when compared to column “VARIANT”. The file in data/resources has been fixed.
The same file can be downlaoded from PSG catalog (https://www.pgscatalog.org/score/PGS003850/). However, this file has the same errors and doesn’t have the column VARIANT to help fixing it.

TARGET DATA. Imputed genotyping data from a population of interest. Simulated data based on GenRisk data will be used here. For reasons of space, only the data of the patients with the SNPs associated with CRC can be downloaded (document obtained in STEP 1) Download - GenRisk_all_CRC_SNPs_perm.vcf .

To be used from the section Step 1 - SNP extraction. 2 VCF files concatenation

METADATA. Phenotypic data of the target data. In this case which of the samples is controlled and which CRC. Metadata_GenRIsk -

Attention: Whether you wish to utilize your own data or existing datasets, it’s important to note that the data used should have already undergone a minimum quality control (QC) process, with hg37 as the reference genome. For the target data, it has been imputed using the 1000 Genomes reference.

For more information regarding the format of the data to be used or how to adapt see here

Info

This tutorial requires R and a linux terminal. You can use Rstudio or Visual Studio Code (with R plugins), that include a terminal.

Both in R and in terminal, first check that you are in the correct directory. In R remember to load above mentioned libraries.

You can clone our github repository to reproduce this practical. The exception is STEP 1, that requires real imputed genotype data. We canot share real data, but we provide fake data resulting from the pipeline in directort data/resources)

LINUX Terminal

# clone the github repository

git clone https://github.com/odap-ico/PRS_tutorial
cd PRS_tutorial

What’s a Polygenic Risk Score

A Polygenic Risk Score (PRS) is a single numerical metric that estimates an individual’s genetic risk for a particular disease or trait (phenotype). This score is calculated by summing the contributions of risk alleles carried by the individual, each weighted by its respective effect size as determined by a Genome-Wide Association Study (GWAS) specific to the phenotype in question.

In essence, the PRS encapsulates the combined impact of various genetic variants, predominantly Single Nucleotide Polymorphisms (SNPs), within an individual’s genome. A higher PRS indicates a greater genetic predisposition, while a lower score suggests a reduced risk, thereby facilitating the prediction of disease susceptibility or trait expression based on one’s genetic makeup. This approach offers valuable insights into the genetic architecture of complex traits and diseases.

Effect size - βi

In PRS, there exist two primary approaches:

Unweighted PRS, where each genetic variant is assigned a fixed weight of 1 (βi = 1).
Weighted PRS, which employs the actual effect sizes, typically represented as betas (βi) or Odds Ratios (OR), derived from GWAS specific to the phenotype in question.

The choice between these two approaches is contingent upon several key factors.

¿When to use an unweighted PRS (βi = 1)?

Your data have been used for the GWAS -> model overfitting
Poorly robust GWAS

PRS applications

Disease risk prediction
Screening and prevention.
Personalized treatment
Epidemiological Studies
Drug Development

Step-by-step PRS analysis

The step-by-step analysis structure that will be followed to perform the PRS analysis is as follows:

STEP 0: Data selection + QC check. The process begins with the careful selection of genetic data (SNPs data and target data) relevant to the study, ensuring that it accomplishes general quality control standards commonly used in GWAS analyses.
STEP 1: SNPs Extraction. Consists of the extraction of the SNPs associated with the study characteristic or disease from the total pool of SNPs in the target dataset where the PRS will be calculated.
STEP 2: SNPs Alignament check. Discrepancies, such as missing SNPs, duplicates, variations in strand orientation, and more, can occur between the SNPs in the SNP dataset and the target data. These disparities may arise due to differences in platforms, imputation methods, or genome builds. Therefore, it is essential to carefully evaluate and resolve such differences whenever possible.
STEP 3: Quality control (QC). Imputed genotyping data, QC is performed both before and after imputation, ensuring the use of high-quality data and the exclusion of poorly imputed variants.
STEP 4: PRS calculation. Calculate the PRS by combining the genetic variants (SNPs) and their associated effect sizes in the target data.
STEP 5: PRS results. This involves drawing meaningful insights from the calculated scores and making conclusions about genetic predisposition for the studied phenotype.

Other Basic Genetic Concepts

STEP 0 - Data selection + QC check

Below is detailed where to find, in which formats, what minimum information is required as well as what minimum QC the data must have passed.

SNPs Data Selection

The initial phase in the analysis of the Polygenic Risk Score (PRS) involves acquiring the base data, which comprises a list of SNPs associated with the specific trait or condition of interest.

These SNP datasets can be sourced from publicly available GWAS studies found in the literature or various genetic databases. Alternatively, researchers can conduct their own GWAS to generate the required summary statistics.

For the purpose of this tutorial, we will utilize SNP data associated with Colorectal Cancer (CRC) as our reference dataset.

SNPs data are a set of individual SNPs that have been identified as related to the characteristic or trait studied through a GWAS study. These data usually include the position of the SNP in the genome, the estimated genetic effect (such as odds ratio or beta), the allelic frequency, and other related statistics.

- Where to find:

Polygenic Score (PGS) Catalog. https://www.pgscatalog.org/
GWAS Catalog. https://www.ebi.ac.uk/gwas/
Bibliography
Calculate your own GWAS

- Minimum information required:

Chromosome nº
Position
Effect and non-effect allele
Effect size (betas or OR)

- QC check:

It is necessary to ensure that the list of SNPs selected has passed a minimum QC check. -> those applied to a GWAS

Genome build -> Check that the reference genome is the same between the SNP data and the target data. Otherwise, it is necessary to carry out a Liftover. Better perform on the data of the SNPs of interest.
Effect allele
Standard GWAS QC:
- Genotyping rate >0,99
- Remove SNPs with >5% missing data
- Heterozygosity. Remove samples > mean(proportion of heterozygous sites ) ± 3 * sd(proportion of heterozygous sites)
- Exclude individuals with sex discordance
- Imputation information score (INFO score). > 0.3 – 0.8
- Minor Allele Frequency (MAF). > 0.005 – 0,001
- Linkage-disequilibrium (LD) score
- Hardy-Weinberg equilibrium (HWE) P>1x10-6

Target Data

TARGET DATA means the genotyping data of a population already imputed. (Usually own data - collaborations)

This data can be in multiple formats some of the most common are detailed below. (Depending on where they are downloaded or with which the charge is made).

- Imputed genotype data formats:

Variant Call Format (VCF). e.g chr.{i}vcf.gz
PLINK 1.9 format [.bed] [.bim] [.fam]
BGEN format [.bgen], [.sample], [.bgen.bgi]
PLINK 2 format [.pgen], [.pvar], [.psam]

NOTE: Generally, imputation with 1000 genomes or TopMed results in vcf files separated by chromosomes. This tutorial starts with data in this format.

- QC check:

More than 100 individuals
Genome build
No sample overlap with SNPs GWAS samples
Standard GWAS For samples:
- Individuals genotype rate >0.99
- Remove sample with >10% missing data
- Heterozygosity. Remove samples > mean(proportion of heterozygous sites ) ± 3 * sd(proportion of heterozygous sites)
- Exclude individuals with sex discordance
- Remove duplicates and relatedness samples. PI_HAT > 0.8 For SNPs:
- Imputation information score (INFO score). > 0.3 – 0.8
- Minor Allele Frequency (MAF). > 0.01 - 0.005
- Linkage-disequilibrium (LD) score
- Hardy-Weinberg Equilibrium (HWE) p > 1x10-6

ATTENTION! The quality control for this data is conducted both before and after imputation. Post-imputation quality control (QC) related to SNPs can be carried out either before initiating the Polygenic Risk Score (PRS) analysis or during the analysis. In this instance, SNP QC is performed after extracting the SNPs of interest from the Target Data. This approach is chosen because the dataset is smaller, making the process faster.

STEP 1 - SNPs Extraction

This section consists of two parts:

SNPs list preparation
SNPs extraction

SNPs list preparation

Preparation of a list of SNPs associated with CRC for extraction in the target data. A .txt file with two columns:

Chromosome number
SNP position

Example .txt file with the identifiers of the SNPs of interest.

ATTENTION! If the reference genome does not match the list of SNPs associated with the study trait and the target data to be used, a Liftover must be performed, see here.

In this case both the list of SNPs and the target date are in GRCh37.

- Load SNPs associated to CCR:

SNPs Data from Ceres Fernandez-Rozadilla et al. Nature 2023, saves as: .1_CRC_SNPs_list_205_Nature_2023.xlsx in resources folder.

# load data
CRC_SNP <- read_excel("data/resources/1_CRC_SNPs_list_205_Nature_2023.xlsx")

# The first 5 rows and first 7 columns are shown
CRC_SNP[c(1:5),c(1:7)]

CHR	POS (GRCh37)	VARIANT	rsID	EFFECT_ALLELE	OTHER_ALLELE	BETA
1	22503282	1:22503282_G/C	rs2807367	C	G	-0.03790
1	22710877	1:22710877_G/C	rs34963268	C	G	-0.06589
1	38461319	1:38461319_C/A	rs61776719	A	C	-0.05610
1	55247852	1:55247852_A/G	rs12143541	G	A	0.08250
1	62673037	1:62673037_T/C	rs7542665	C	T	0.01850

- Cration of chr position .txt file

The CHR and POS columns (GRCh37) are saved in a .txt file.

# .txt file generation
SNP_file_CRC <- CRC_SNP[,c(1,2)]
head(SNP_file_CRC)

CHR	POS (GRCh37)
1	22503282
1	22710877
1	38461319
1	55247852
1	62673037
1	71040166

- Save .txt file

# two avoid scientific annotation
SNP_file_CRC$`POS (GRCh37)` <- format(SNP_file_CRC$`POS (GRCh37)`, scientific=FALSE)

# create folder data/1_snps_extraction/ and save as .txt
dir.create("data/1_snps_extraction/", recursive=TRUE)
write.table(SNP_file_CRC, "data/1_snps_extraction/1_SNP_file_CRC.txt", quote = F, col.names = F, row.names = F, qmethod = "double", sep = "\t")

SNPs extraction

The aim of this step is to extract data on SNPs associated with colorectal cancer from GenRisk data.

It consists of 3 substeps:

1 SNP Extraction
2 VCF files concatenation
3 Conversion to PLINK2 format

- GenRisk imputed data

In this case, the GenRisk data already imputed by 1000 genomes will be used. Data imputed with TopMed or 1000Genomes servers return data in vcf.gz format:

GenRisk Imputed data files:

ATTENTION! The data is substantial and cannot be shared. As a solution, the code for extracting CRC SNPs using the Linux bcftools tool, chromosome by chromosome, will be demonstrated without execution. Next, the process of merging individual files from each chromosome will be presented. This final outcome is what will be made available to proceed with the exercise.

In other words, the data to which you have access are the results of the VCF files concatenation section and are called: GenRisk_all_CRC_SNPs.vcf.gz

On the other hand, the shared data have been modified to ensure compliance with personal data protection regulations.

# this is specific to each setting

# set the directory containing the imputed genotype data.
data_dir <-Sys.getenv("IMPUTATION_1000G_data") 

# directory for SNP extraction
vcf_dir <- "data/1_snps_extraction"

- 1 SNPs extraction:

LINUX Terminal

# Extraction of the SNPs in a loop for the different chromosomes. 

## bcftool view <directory/_chr*_vcf.gz> -R <directory/SNP_list.txt> -o <directory/output_file.vcf.gz>
### view -> allows filtering and viewing data contained in VCF files.
### -R -> Region. specific region of the genome
### -o -> Output. 


for i in {1..22}
do
bcftools view data_dir/GenRisk_chr${i}.vcf.gz -R data/1_snps_extraction/1_SNP_file_CRC.txt -o data/1_snps_extraction/GenRisk_chr${i}_CRC_SNPs.vcf.gz
done


#time 1 min 30 s

The application of this code generates new vcf files for each chromosome, only with the SNPs present in the file 1_SNP_file_CRC.txt. The following vcf files are generated:

CRC data files

- 2 VCF files concatenation:

Merge of GenRisk_chr${i}_CRC_SNPs.vcf.gz into a single file:

# Creation of a file concatenation list: 

chromList<-list.files(path = vcf_dir,  pattern="GenRisk_chr*")
write.table(chromList,file= paste0(vcf_dir,"/2_chromList.txt"),quote=FALSE,row.names = FALSE,col.names = FALSE)

This is the list of files to be concatenated that is created:

LINUX Terminal


# Concatenation:
bcftools concat -f data/1_snps_extraction/2_chromList.txt -Oz -o data/1_snps_extraction/GenRisk_all_CRC_SNPs_perm.vcf.gz

# time 6s

GenRisk_all_CRC_SNPs_perm.vcf.gz is the data that can be downloaded (folder data/resources) and worked from here.

NOTE: The GenRisk_chr${i}_CRC_SNPs.vcf.gz files created during SNP extraction are intermediate, and it is recommended to delete them later to avoid unnecessary space consumption

LINUX Terminal

# Remove the individual chromosome files*

rm data/1_snps_extraction/GenRisk_chr*_CRC_SNPs.vcf.gz

- 3 Conversion to PLINK2 format

Plink2 is a widely utilized tool for genetic data analysis. Converting VCF files to Plink2 enables the utilization of various features, including the creation of genetic relationship matrices and the calculation of polygenic risk scores (PRS). This conversion not only streamlines data compatibility with other tools but also enhances execution speed. Plink2, being equipped with PRS calculation functionality, further facilitates genetic analysis processes.

PLINK2 format

PLINK2 conversion

Below conversion to PLINK 2 format, relabel variant IDs to ensure they have a unique identifier, and print a frequency report necessary to identify ambidextrous SNPs (SNPs QC, next steps).

Resulting files: - PLINK 2 format (.pgen, .pvar, .psam). - Obtain allele frequency report (.afreq).

LINUX Terminal


plink2 --vcf data/resources/GenRisk_all_CRC_SNPs_perm.vcf.gz \
--new-id-max-allele-len 50 \
--set-all-var-ids @:#_\$r_\$a \
--freq \
--make-pgen \
--out data/1_snps_extraction/raw_data

Code notes:

–vcf.gz -> input file
–new-id-max-allele-len 50 -> set the maximum length for new allele identifiers (IDs) in the output format. By default it is 14 and it may not be enough
–set-all-var-ids @:#$r$a -> set SNPs id -> chr:pos_ref_alt
–freq -> compute allele frequency and output to .afreq file
–make-pgen -> generate files in plink 2 format

Results: When applying this caddy, 4 files are generated:

raw_data.freq
raw_data.log
raw_data.pgen
raw_data.psam
raw_data.pvar

QUESTION 1 How many variables have been detected? How many samples?

205 variables and 4295 samples

STEP 2 - SNPs Alignament check

Inconsistencies, including absent SNPs, duplications, variations in strand orientation, and other discrepancies, may exist between the SNP dataset and the target data. These variations could stem from divergences in platforms, imputation methods, or genome builds. Hence, it is crucial to thoroughly assess and address these distinctions whenever feasible.

These discrepancies can be:

Missing SNPs
Multi-allelic / biallelic / Duplicated SNPs
Mismatching SNPs
Ambiguous SNPs

Load data

# re-load original CRC SNPs list.
CRC_SNP <- read_excel("data/resources/1_CRC_SNPs_list_205_Nature_2023.xlsx")

# Load the data of the SNPs extracted from the target data:
pvar<-fread("data/1_snps_extraction/raw_data.pvar")

colnames(pvar)<- c("CHR",   "POS",  "ID"    ,"REF", "ALT",  "FILTER",   "INFO")
head(pvar)

CHR	POS	ID	REF	ALT	FILTER	INFO
1	22503282	1:22503282_G_C	G	C	PASS	AF=0.63412;MAF=0.36588;R2=0.94997;IMPUTED;AC=5515;AN=8590
1	22710877	1:22710877_G_C	G	C	PASS	AF=0.18103;MAF=0.18103;R2=0.96771;IMPUTED;AC=1590;AN=8590
1	38461319	1:38461319_C_A	C	A	PASS	AF=0.49037;MAF=0.49037;R2=0.79922;IMPUTED;AC=4458;AN=8590
1	55247852	1:55247852_A_G	A	G	PASS	AF=0.12046;MAF=0.12046;R2=0.87593;IMPUTED;ER2=0.83497;TYPED;AC=1028;AN=8590
1	62673037	1:62673037_T_C	T	C	PASS	AF=0.66593;MAF=0.33407;R2=0.99843;ER2=0.96251;TYPED;AC=5727;AN=8590
1	71040166	1:71040166_G_T	G	T	PASS	AF=0.38298;MAF=0.38298;R2=0.80064;IMPUTED;ER2=0.84;TYPED;AC=3339;AN=8590

Missing SNPs

The selected SNPs may not all be available in the target dataset (Missing SNPs), which could be due to:

Different Genotyping / imputation platform
SNP have not passed QC (low quality)

Although it looks like all SNPs have been found (205) Let’s see if any are missing.

Find SNPs by chr:position

# Match variant format
CRC_SNP$id_pos <- paste0(CRC_SNP$CHR,":",CRC_SNP$`POS (GRCh37)`)
pvar$id_pos <- paste0(pvar$CHR,":",pvar$POS)


# SNPs not found
snpsNOTpvar<-CRC_SNP[!(CRC_SNP$id_pos %in% pvar$id_pos), ]

QUESTION How many variables were found to be missing in our data? Which one?

snpsNOTpvar

CHR	POS (GRCh37)	VARIANT	rsID	EFFECT_ALLELE	OTHER_ALLELE	BETA	SE	P	REF	PMID	PMCID	DOI	PUBLISHED_SNP	LD across all included populations	id_pos
X	9763898	23:9763898_C/G	rs2732875	C	G	0.169	0.024 *	8.3E-12 *	Dunlop MG, Nat Genet, 2012	22634755	PMC4747430	10.1038/ng.2293	rs5934683	R2:0.28; D’:0.94	X:9763898

1 variable have been identified as missing. The missing SNP is the one on the X chromosome, as sex chromosome data is not being used.

QUESTION What is the problem with the 205 SNPs extracted if we already know that one is missing?

The problem is that if the missing X chromosome SNP is subtracted from the initial 205 SNPs, theoretically, we would expect to extract 204 SNPs, indicating an extra SNP. There is a duplicated/multi-allelic/bialellic SNP?

Multi-allelic / Duplicated / Biallelic SNPs

Let’s check if there are any duplicate chr:position IDs:

table(duplicated(pvar$id_pos))

## 
## FALSE  TRUE 
##   204     1

Yes, there is a duplicated chromosome:position ID. Which one?

multi_all <- pvar[duplicated(pvar$POS) | duplicated(pvar$POS, fromLast = TRUE), ]
multi_all

CHR	POS	ID	REF	ALT	FILTER	INFO	id_pos
12	64404555	12:64404555_T_A	T	A	PASS	AF=0.0005;MAF=0.0005;R2=0.56084;IMPUTED;AC=3;AN=8590	12:64404555
12	64404555	12:64404555_T_C	T	C	PASS	AF=0.46244;MAF=0.46244;R2=0.80188;IMPUTED;AC=3995;AN=8590	12:64404555

To know which one is the incorrect, it must be compared with the original listing.

# the format of the variable ID between tables is equalized. From chr:position_other_allele/effect_allele to chr:position_other_allele_effect_allele
CRC_SNP$ID <- gsub("/","_",CRC_SNP$VARIANT)

#Which is the correct?
id_multi <- multi_all[!multi_all$ID %in% CRC_SNP$ID,] 
id_multi

CHR	POS	ID	REF	ALT	FILTER	INFO	id_pos
12	64404555	12:64404555_T_A	T	A	PASS	AF=0.0005;MAF=0.0005;R2=0.56084;IMPUTED;AC=3;AN=8590	12:64404555

#Save the incorrect SNP to delete in the QC step:
## Structure: CHROM  ID      REF     ALT
## ID = chr:pos_REF_ALT

id_multi_to_remove <- id_multi[,c(1,3:5)]

dir.create("data/2_SNPs_alignament_check")

write.table(id_multi_to_remove,"data/2_SNPs_alignament_check/multi_to_remove.txt", col.names = F, row.names = F, quote = F)

This incorrect SNP should be removed from the data. This process is carried out together with the quality control in STEP 3 - QC.

Mismatching SNPs

It is essential to verify whether the REF and ALT alleles in the target data match those in the SNP list. For example, in the target data, there may be SNPs with A/C alleles in the base data and G/T alleles in the target data. This discrepancy can occur when one dataset was genotyped with reference to the forward strand, and the other was genotyped with reference to the reverse strand.

SNPs that have mismatched alleles reported in the original SNP list and the target data can be resolved by strand-flipping the alleles to their complementary alleles. Most polygenic scoring software programs perform strand shifting automatically for resolvable SNPs and remove mismatched non-resolvable SNPs.

SNPs with mismatched alleles between the SNP list and the target data can be resolved by “strand-flipping” the alleles to their complementary pairs. Most polygenic scoring software programs automatically perform strand flipping for resolvable SNPs and exclude non-resolvable SNPs with mismatches.

PLINK2 does not do this automatically, if there is a mismatching SNP in the data, the “strand-flipping” must be done manually.

All SNPs are then verified to have the correct alleles, using ID = chr:position_other_allele_effect_allele.

## SNPs that have been extracted: 
dim(CRC_SNP[CRC_SNP$ID %in% pvar$ID,])

## [1] 204  17

## SNPs that have not been extracted: 
no_snps <- CRC_SNP[!CRC_SNP$ID %in% pvar$ID,]
no_snps

CHR	POS (GRCh37)	VARIANT	rsID	EFFECT_ALLELE	OTHER_ALLELE	BETA	SE	P	REF	PMID	PMCID	DOI	PUBLISHED_SNP	LD across all included populations	id_pos	ID
X	9763898	23:9763898_C/G	rs2732875	C	G	0.169	0.024 *	8.3E-12 *	Dunlop MG, Nat Genet, 2012	22634755	PMC4747430	10.1038/ng.2293	rs5934683	R2:0.28; D’:0.94	X:9763898	23:9763898_C_G

It has been confirmed that all SNPs have the correct alleles, except for the SNP on the X chromosome

Ambiguous SNPs

Ambiguous (Palindromic) SNPs are those in which the allelic variants are complementary. In the context of PRS calculation, they are SNPs where the two allelic variants are symmetrical (e.g., A/T in the SNP list and T/A in the target data) and have similar frequencies, making it difficult to distinguish which variant is the effective one. If you have access to the effect allele frequency (EAF) data from the list of SNPs of interest, you can compare it with the EAF data in your dataset to determine which allele is the reference (REF). However, EAF data is rarely available in downloadable SNP lists.

In cases where EAF data is not available or when EAF is close to 0.5, these ambiguous SNPs should be excluded. In PRS calculations, SNPs with allelic frequencies close to 0.5 (between 0.4 and 0.6) are considered ambiguous and are typically excluded.

To check if there are ambiguous SNPs in the data, first load the EAF data from the target data obtained during the extraction of the SNPs and apply an EAF filter >0.4 | <0.6:

# load SNPs frequencyes  
freq<-read.table("data/1_snps_extraction/raw_data.afreq",header=F,sep="\t")
colnames(freq)<- c("CHROM", "ID",   "REF",  "ALT",  "ALT_FREQS",    "OBS_CT")
head(freq)

CHROM	ID	REF	ALT	ALT_FREQS	OBS_CT
1	1:22503282_G_C	G	C	0.641909	8590
1	1:22710877_G_C	G	C	0.185099	8590
1	1:38461319_C_A	C	A	0.518859	8590
1	1:55247852_A_G	A	G	0.119674	8590
1	1:62673037_T_C	T	C	0.666705	8590
1	1:71040166_G_T	G	T	0.388591	8590

ambiguous_snps <- freq[freq$ALT_FREQS > 0.4 & freq$ALT_FREQS< 0.6 & 
                        ((freq$REF == "A" & freq$ALT == "T") | 
                         (freq$ALT == "T" & freq$REF == "A") | 
                         (freq$REF == "C" & freq$ALT == "G") | 
                         (freq$ALT == "G" & freq$REF == "C")), ]

QUESTION How many ambiguous SNPs were found?

ambiguous_snps

	CHROM	ID	REF	ALT	ALT_FREQS	OBS_CT
57	13	13:34093518_C_G	C	G	0.406170	8590
75	15	15:33009574_C_G	C	G	0.477532	8590
85	16	16:86338288_A_T	A	T	0.515600	8590

There are 3 ambiguous SNPs

Those SNPs are saved to a .txt file to be removed later.

# Save
write.table(ambiguous_snps, "data/2_SNPs_alignament_check/exclrsIDs_ambiguous.txt", sep = "\t", quote = FALSE, row.names = FALSE)

STEP 3 - Quality control (QC)

It is important to ensure that the SNPs chosen to calculate the PRS present a minimum of quality in the Target data. For this reason, the following points will be reviewed:

1. Imputation information score (INFO)
1. Minor allele frequency (MAF)
1. Linkage Disequilibrium (LD)
1. Hardy-Weinberg Equilibrium (HWE)

Any SNP that does not pass this QC will be removed in the final step of this section, Apply QC exclusions.

a) Imputation information score (INFO)

Imputation is a process used to predict or estimate genotypic information for unobserved or missing genetic variants (SNPs) in a dataset based on available genotype data and reference panels.

The imputation information (INFO score – r2) statistic is a measure of imputation quality which typically takes values between 0 and 1, where 0 indicates complete uncertainty and 1 represents complete certainty about the imputed genotype.

Variants with a low imputation information score (r^2 < 0.3 - 0.8) are often regarded as less reliable and are typically excluded from further analyses.

In this case a threshold of 0.4 is used.

# SNPs target date, already loaded. -> pvar

#pvar<-read.table("data/1_snps_extraction/raw_data.pvar",header=F,sep="\t")
#colnames(pvar)<- c("CHR",  "POS",  "ID"    ,"REF", "ALT",  "FILTER",   "INFO")

head(pvar)

CHR	POS	ID	REF	ALT	FILTER	INFO	id_pos
1	22503282	1:22503282_G_C	G	C	PASS	AF=0.63412;MAF=0.36588;R2=0.94997;IMPUTED;AC=5515;AN=8590	1:22503282
1	22710877	1:22710877_G_C	G	C	PASS	AF=0.18103;MAF=0.18103;R2=0.96771;IMPUTED;AC=1590;AN=8590	1:22710877
1	38461319	1:38461319_C_A	C	A	PASS	AF=0.49037;MAF=0.49037;R2=0.79922;IMPUTED;AC=4458;AN=8590	1:38461319
1	55247852	1:55247852_A_G	A	G	PASS	AF=0.12046;MAF=0.12046;R2=0.87593;IMPUTED;ER2=0.83497;TYPED;AC=1028;AN=8590	1:55247852
1	62673037	1:62673037_T_C	T	C	PASS	AF=0.66593;MAF=0.33407;R2=0.99843;ER2=0.96251;TYPED;AC=5727;AN=8590	1:62673037
1	71040166	1:71040166_G_T	G	T	PASS	AF=0.38298;MAF=0.38298;R2=0.80064;IMPUTED;ER2=0.84;TYPED;AC=3339;AN=8590	1:71040166

In pvar the data about INFO (r^2) is in the column called INFO where there is a lot of other data. R2 is extracted and stored in a new column called R2.

eg. INFO estructure AF=0.63412;MAF=0.36588;R2=0.94997;IMPUTED;AC=9067;AN=14206 or AF=0.12046;MAF=0.12046;R2=0.87593;IMPUTED;ER2=0.83497;TYPED;AC=1687;AN=14206

The information to be extracted can be seen in bold.

# Create column R2
pvar$INFO2 <- sub("ER2=.*", "", pvar$INFO)
pvar$R2 <- as.numeric(sub('.*R2=([0-9.]+);.*', '\\1', pvar$INFO2))

QUESTION How many SNPs exhibit an imputation value (R²) lower than 0.4? Which are these snps?

# R2 >0.4
table(pvar$R2 >= 0.4)

## 
## FALSE  TRUE 
##     2   203

2 are the SNPs that have a low imputation score and must be eliminated.

# Extraction of SNPs with low INFO:

r2_04_remove <- pvar[pvar$R2 < 0.4,]
r2_04_remove

CHR	POS	ID	REF	ALT	FILTER	INFO	id_pos	INFO2	R2
1	245181421	1:245181421_T_C	T	C	PASS	AF=0.00087;MAF=0.00087;R2=0.06303;IMPUTED;AC=0;AN=8590	1:245181421	AF=0.00087;MAF=0.00087;R2=0.06303;IMPUTED;AC=0;AN=8590	0.06303
12	31594813	12:31594813_C_T	C	T	PASS	AF=0.00089;MAF=0.00089;R2=0.03916;IMPUTED;AC=0;AN=8590	12:31594813	AF=0.00089;MAF=0.00089;R2=0.03916;IMPUTED;AC=0;AN=8590	0.03916

These SNPs will be added to a .txt file called to_remove later, to be filtered in the last step of this QC.

b) Minor allele frequency (MAF)

Minor Allele Frequency (MAF) indicates the rarity of a genetic variant within a population. Variants with a MAF greater than 5% are considered common, those with a MAF between 1% and 5% are categorized as low frequency, and variants with a MAF less than 1% are considered rare.

In PRS analyses, it’s common practice to filter out SNPs with low frequency because they may not have enough statistical power to provide meaningful insights. In this case, we will apply a threshold of 0.005 directly to the Plink2 function using the –maf flag. This threshold helps ensure that only SNPs with a MAF greater than or equal to 0.005 are included in the analysis, focusing on more common variants for robust and reliable results.

c) Linkage Disequilibrium (LD)

Linkage disequilibrium (LD) measures the connection between closely located genetic variants that often share inheritance patterns because of their proximity, leading to a genetic association within a population.

During the construction of a PRS, variants in high LD will have been identified and removed by methods such as clumping / pruning. This is done to reduce redundancy and bias in the PRS analysis, as it is desired that the markers used be as informative as possible without duplicating information.

In order to verify this, or apply more stringent thresholds, the PLINK 2 command –indep-pairwise can be used to prune the data, resulting in a subset of variants with LD below a specified threshold.

LINUX Terminal

# Pruning / clumping to remove highly correlated SNPs:

# make folder to save QC results
mkdir -p data/3_QC

# remove redundant SNPs 

plink2 \
    --pfile data/1_snps_extraction/raw_data \
    --indep-pairwise 200 1 0.3 \
    --out data/3_QC/LD
      
      ## "200" refers to the distance in kilobases (kb) between the SNPs
      ## "1" refers to the maximum number of SNPs that can be included in a clumping group.
      ## "0.3" is the correlation threshold (R^2) used to determine if two SNPs are "clumpable"

After applying this formula, two files are generated:

LD.prune.out -> Contains SNPs to be removed due to linkage disequilibrium.
LD.prune.in -> Contains SNPs that do not exhibit linkage disequilibrium.

QUESTION How many SNPs exhibit Linkage Disequilibrium?

Let’s see whitch SNPs present a linkage desequilibrium:

ld_snps <- read.table("data/3_QC/LD.prune.out",header=F,sep="\t")
ld_snps <- ld_snps$V1
length(ld_snps)

## [1] 2

ld_snps

## [1] "1:245181421_T_C" "12:31594813_C_T"

d) Hardy-Weinberg Equilibrium (HWE)

Hardy-Weinberg Equilibrium (HWE) is a principle stating that allele and genotype frequencies should remain constant in a stable, unchanging population. Deviations from HWE may suggest genotyping errors, leading to the exclusion of such variants in analyses.

The PLINK 2 command –hwe filters out variants deviating from HWE using a specified p-value threshold.

HWE was previously tested in the GenRisk data; however, it is now confirmed using a threshold of 1e-6:

LINUX Terminal

plink2 \
    --pfile data/1_snps_extraction/raw_data \
    --hwe 1e-6 \
    --write-snplist \
    --out data/3_QC/HWE_QC

No variant should be deleted fro HWE.

Apply QC exclusions

All SNPs that have not passed QC, as well as ambiguous ones, are saved to a file called to_remove.txt and removed from the Target Data using PLINK2.

to_remove.txt creation

In the PRS calculation function, Plink only accepts a list of SNPs to be excluded. Therefore, we need to combine the list of:

SNP extracted and not included in the list of interests. 1 SNP, save as multi_to_remove.txt
Ambiguous SNPs. 3 SNPs, save as exclrsIDs_ambiguous.txt
SNPs with low INFO. 2 SNPs, save as r2_04_remove (r object).
LD SNPs. 2 SNPs, save as ld_snps (r object)

This combined list is saved in a file named “to_remove.txt.”

# The incorrect SNP extracted load.
multi_remove <-read.table("data/2_SNPs_alignament_check/multi_to_remove.txt", colClasses = c("character"))
colnames(multi_remove) <- c("CHROM", "ID", "REF", "ALT")
multi_remove

##   CHROM              ID REF ALT
## 1    12 12:64404555_T_A   T   A

# The ID columns are selected
ambiguous_snps_ID <- ambiguous_snps$ID
r2_04_remove_ID <- r2_04_remove$ID
multi_remove_ID  <- multi_remove$ID

The various SNPs lists are cross-referenced to identify matches.

table(ambiguous_snps_ID == multi_remove_ID)

## 
## FALSE 
##     3

table(ambiguous_snps_ID %in% r2_04_remove_ID)

## 
## FALSE 
##     3

table(multi_remove_ID %in% r2_04_remove_ID)

## 
## FALSE 
##     1

table(ld_snps %in% multi_remove_ID)

## 
## FALSE 
##     2

table(ld_snps %in% r2_04_remove_ID)

## 
## TRUE 
##    2

table(ld_snps %in% ambiguous_snps_ID)

## 
## FALSE 
##     2

QUESTION Finally, how many SNPs have failed QC and will be removed?

There are 2 SNPs that present both LD and a low imputation quality score. So in total we have 6 SNPs that have not passed the QC and will be removed.

# Union of the tables
to_remove <- unique(c(ambiguous_snps_ID,multi_remove_ID,r2_04_remove_ID,ld_snps ))
to_remove

## [1] "13:34093518_C_G" "15:33009574_C_G" "16:86338288_A_T" "12:64404555_T_A"
## [5] "1:245181421_T_C" "12:31594813_C_T"

# Save
write.table(to_remove, "data/3_QC/to_remove.txt", quote = FALSE, row.names = FALSE, col.names = F)

Filtering of SNPs with PLINK2

The next step involves data filtering to exclude SNPs in the to_remove.txt. On the other hand, it is still pending to apply the filter based on Minor Allele Frequency (MAF) using the –maf flag.

LINUX Terminal


#Scrip aplicat:
plink2 \
--pfile data/1_snps_extraction/raw_data \
--exclude data/3_QC/to_remove.txt \
--maf 0.005 \
--write-snplist \
--make-pgen \
--out data/3_QC/dataQC

The outputs of the command above are:

dataQC.pgen, dataQC.psam, dataQC.pvar: Data after SNP QC (excpet HWE) in PLINK 2 binary format
dataQC.snplist: List of IDs of SNPs that passed QC with the specified thresholds (from --write-snplist)

QUESTION What is the total number of SNPs remaining?

199 SNPs

From the initial 205, the following have been removed:

-1 for being on the X sex chromosome
-3 for ambiguous
-2 for low imputation score
-0 for low MAF
-2 (idem INFO) LD

STEP 4 - PRS calculation

1 Prepare score.txt
2 PRS calculation
3 PRS score rescaling

1 Prepare score.txt

Creating a file with 3 columns:

ID of the SNPs that have passed the QC
The RISK_ALLELE
Betas

Thus ensuring that the risk allele according to the beta value.

This process is carried out to ensure that the PRS are consistent and accurately reflect genetic risk. In some genetic analyses, beta values associated with a SNP can be positive or negative depending on their association with a particular disease or trait. By changing the sign of negative betas to make them positive, it ensures that all genetic contributions are positive. This makes it easier to interpret them in terms of cumulative genetic risk.

Load data

It is necessary both the data of the SNPs of the target data and the beta values of the original list of the SNPs of interest. A series of modifications are made to match column names and data structure.

The ID in the target data has this structure: chr:position_ref_alt - e.x: 1:110365045_G_A.

In the case of the original SNPs list with beta data (representing the weight of each SNP in the trait under study), the structure is as follows: chr:position_other_risk, where the risk allele is determined by the sign of the beta value. In other words:

For SNP 1:22503282_G_C with a positive beta value, the risk allele is C.
For SNP 1:22503282_G_C with a negative beta value, the risk allele is G

# Load Info SNPs resultants del QC
snps_qc<-read.table("data/3_QC/dataQC.pvar")
colnames(snps_qc) <- c("CHROM", "POS",  "ID", "REF", "ALT", "FILTER", "INFO")

#Transform ID to: chr:position_all_1:all_2 format
#snps_qc$ID <- gsub("_",":",snps_qc$ID)

#Keep ID and ALT / RISK_ALLELE
names(snps_qc)[names(snps_qc) == "ALT"] <- "RISK_ALLELE"
names(snps_qc)[names(snps_qc) == "REF"] <- "OTHER_ALLELE"
snps_qc <- snps_qc[,c("ID", "RISK_ALLELE","OTHER_ALLELE" )]

head(snps_qc)

ID	RISK_ALLELE	OTHER_ALLELE
1:22503282_G_C	C	G
1:22710877_G_C	C	G
1:38461319_C_A	A	C
1:55247852_A_G	G	A
1:62673037_T_C	C	T
1:71040166_G_T	T	G

From the data of the initial SNPs list we only save the ID and the beta value:

# Load beta values.
betas <- read_excel("data/resources/1_CRC_SNPs_list_205_Nature_2023.xlsx")

#Transform ID to: chr:position_all_1:all_2 format -> 1:110365045_A_G
names(betas)[names(betas) == "VARIANT"] <- "ID"
names(betas)[names(betas) == "BETA"] <- "RISK_SCORE"
#betas$ID <- gsub("_",":",betas$ID)
betas$ID <- gsub("/","_",betas$ID)

betas <- betas[,c(3,7)]

head(betas)

ID	RISK_SCORE
1:22503282_G_C	-0.03790
1:22710877_G_C	-0.06589
1:38461319_C_A	-0.05610
1:55247852_A_G	0.08250
1:62673037_T_C	0.01850
1:71040166_G_T	-0.04060

Merge information

It is checked that beta values are available for all SNPs and the two tables are joined according to the IDs.

# Checking

table(snps_qc$ID %in% betas$ID)

## 
## TRUE 
##  199

# Merge 
score <- merge(snps_qc, betas, by = "ID", all.x = TRUE)
head(score)

ID	RISK_ALLELE	OTHER_ALLELE	RISK_SCORE
1:110365045_G_A	A	G	0.04020
1:172864224_A_G	G	A	-0.04088
1:183056222_T_C	C	T	-0.07060
1:201885446_G_T	T	G	-0.03869
1:205163798_G_A	A	G	-0.07142
1:222045446_G_T	T	G	0.05670

setting all beta values to positive

Modification of the score_modif.txt list with the aim of setting all beta values to positive.

In cases where the beta value is negative:

1.- the risk alleles are exchanged - other 2.- The sign (- -> +) of the beta value is changed.

The results are saved in score_final.txt

#Create new RISK_ALLELE column to be able to modify it according to the beta value
score$RISK_ALLELE_2 <- score$RISK_ALLELE
head(score)

ID	RISK_ALLELE	OTHER_ALLELE	RISK_SCORE	RISK_ALLELE_2
1:110365045_G_A	A	G	0.04020	A
1:172864224_A_G	G	A	-0.04088	G
1:183056222_T_C	C	T	-0.07060	C
1:201885446_G_T	T	G	-0.03869	T
1:205163798_G_A	A	G	-0.07142	A
1:222045446_G_T	T	G	0.05670	T

#Change Risk_allele <-> Other_allele in the negative beta values cases:
for (i in 1:nrow(score)){
      
    # iterate over the column nº4 ("RISK_SCORE")  and check if <0
    if(score[i,4]<0){
           
        # replace the value with OTHER_ALLELE (column 3) in the new column RISK_ALLEL_2 (5)
        score[i,5]<- score[i,3]
        }
}

#Change beta value sign - to -> +
score$RISK_SCORE_2 <- as.numeric(gsub("-", "",score$RISK_SCORE ))


#Delete some columns
score[,2:4] <- NULL

head(score)
</div>
<br>

ID	RISK_ALLELE_2	RISK_SCORE_2
1:110365045_G_A	A	0.04020
1:172864224_A_G	A	0.04088
1:183056222_T_C	T	0.07060
1:201885446_G_T	G	0.03869
1:205163798_G_A	G	0.07142
1:222045446_G_T	T	0.05670

#Save the new score doc as score_final.txt 

dir.create("data/4_PRS_calculation",recursive = TRUE)

write.table(score,"data/4_PRS_calculation/score_final.txt", col.names = F, quote = F, row.names = F)

2 PRS calculation:

LINUX Terminal

#PRS calculation 

plink2 \
--pfile data/1_snps_extraction/raw_data \
--extract data/3_QC/dataQC.snplist \
--score data/4_PRS_calculation/score_final.txt no-mean-imputation \
--out data/4_PRS_calculation/PRS

Result file:

PRS.sscore

3 PRS score rescaling

The SCORE_AVG values are initially in a very small range and are rescaled to larger values to enhance result interpretability. Using the scale function -> https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/scale

# Load results:
scores <- read.table("data/4_PRS_calculation/PRS.sscore")
colnames(scores)<- c("IID", "NMISS_ALLELE_CT",  "ALLELE_DOSAGE_SUM",    "SCORE1_AVG")

#Rescale the scores
scores$SCORE1_AVG_scale <- scale(scores$SCORE1_AVG, center = T, scale = T)

head(scores)

IID	NMISS_ALLELE_CT	ALLELE_DOSAGE_SUM	SCORE1_AVG	SCORE1_AVG_scale
A00002	398	184	0.0295023	-0.45793566
A00005	398	187	0.0289993	-0.82414216
A00011	398	188	0.0291383	-0.72294394
A00012	398	176	0.0284477	-1.22573163
A00014	398	186	0.0296143	-0.37639465
A00016	398	188	0.0302292	0.07128005

STEP 5 - PRS results

Then and as a last step, the results are reviewed and plotted to verify that the calculated PRS really discriminates between CRC cases and controls.

1 The calculated scores are combined with the sample phenotypes (case or control).
2 A t.test is performed between cases and controls.
3 The scores are plotted based on the case-control status.
- Histogram
- Density plot
- Boxplot
- qqplot

Assign the phenotype

Add phenotypes of interest

The phenotypic data of the target data are loaded.

# Samples to extract
samples<-read.table("data/resources/GenRisk_ImputedData_CRC_Samples.txt",header=TRUE,stringsAsFactors=FALSE,sep="\t")

head(samples)

ID	tipo_cancer
A00002	Control
A00005	Control
A00011	Control
A00012	Control
A00014	Control
A00016	Control

table(samples$tipo_cancer)

## 
## Colorectal    Control 
##       1431       2864

merge scores + phenotypes

Use of all CRC and all Controls selected in the 0_Data_sample_preparation.

#Unir columnes samples i PRS.score per les ID de pacients
scores_samples<-merge(scores,samples,by.x="IID",by.y="ID",all=FALSE)

head(scores_samples)

IID	NMISS_ALLELE_CT	ALLELE_DOSAGE_SUM	SCORE1_AVG	SCORE1_AVG_scale	tipo_cancer
A00002	398	184	0.0295023	-0.45793566	Control
A00005	398	187	0.0289993	-0.82414216	Control
A00011	398	188	0.0291383	-0.72294394	Control
A00012	398	176	0.0284477	-1.22573163	Control
A00014	398	186	0.0296143	-0.37639465	Control
A00016	398	188	0.0302292	0.07128005	Control

Statistics

A t.test is performed between cases and controls to check if there are differences between the calculated scores of these two groups.

# plots
ttest<-t.test(scores_samples$SCORE1_AVG_scale~scores_samples$tipo_cancer)
tpval<- t.test(scores_samples$SCORE1_AVG_scale~scores_samples$tipo_cancer)$p.value
tpval

## [1] 2.498608e-45

PRS Plots

Histogram

# Histogram:
histo <- ggplot(scores_samples, aes(x = SCORE1_AVG_scale))+ 
  geom_histogram(aes(y=..density.. ,fill = tipo_cancer, color = tipo_cancer), alpha = 0.5, position = "identity") +
  scale_fill_manual( values = c("steelblue4", "gray87")) +
   scale_color_manual(values = c("steelblue4", "gray50"))+
   theme_classic()+
  xlab("Polygenic risk score (PRS)") + 
  ylab("Density")+
 
  #labs(fill="xyz")+
   labs(title = "Colorectal cancer", subtitle = "p <0.0001 ")+
   theme(text = element_text(size = 16),
         legend.position = "bottom") +
theme(legend.title=element_blank())

histo

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Density plot

# Density plot
density <- ggplot(scores_samples, aes(x = SCORE1_AVG_scale, color = tipo_cancer))+ 
  geom_density()+
  scale_color_manual(values=c("#99B1C4", "gray50"))+
   labs(title = "Risk score distribution",
              #subtitle = "Plot of length by dose",
              caption = paste0("t-test pval=",tpval))+
   theme_minimal()+
  theme(text = element_text(size = 20))

density

Boxplot

# Boxplot
boxplot <- ggplot(scores_samples, aes(x=tipo_cancer, y=SCORE1_AVG_scale, fill=tipo_cancer)) +
  geom_boxplot()+
  labs(x="", y = "Risk Score")+
  scale_x_discrete(limits=c("Control", "Colorectal"))+
  scale_fill_manual(values=c("#99B1C4", "gray87"))+
  theme_classic()+
  theme(text = element_text(size = 20), legend.position = "none")

boxplot

qqplot

# qqplot
qwe1<-qqnorm(scores_samples$SCORE1_AVG_scale[scores_samples$tipo_cancer =="Control"],plot.it=FALSE)
qwe2<-qqnorm(scores_samples$SCORE1_AVG_scale[scores_samples$tipo_cancer=="Colorectal"],plot.it=FALSE)

plot(qwe1$x,qwe1$y,pch=19,las=1,main="",col="darkgrey",xlab="Theoretical Quantiles",ylab="Sample Quantiles")
qqline(scores_samples$SCORE1_AVG_scale[scores_samples$tipo_cancer== "Control"],col="darkgrey",lwd=2)
points(qwe2$x,qwe2$y,pch=19,col="#99B1C4")
qqline(scores_samples$SCORE1_AVG_scale[scores_samples$tipo_cancer== "Colorectal"],col="#99B1C4",lwd=2)
title("Risk score Q-Q Plot",sub=paste0("t-test pval=",tpval))

Extra information

Liftover

Transform data from a genome version to another genome version

Suppose our data is in hg19 and we want to impute using TOPMed panel as reference hg19 to hg38 https://github.com/sritchie73/liftOverPlink

Download the chain file that allows the mass conversion of coordinates from one assembly to another. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/

liftOverPlink.py 
-m plinkFile.map 
-p plinkFile.ped 
-o plinkFile_lifted 
-c hg19ToHg38.over.chain.gz

or see https://genviz.org/module-01-intro/0001/06/02/liftoverTools/

Credits

This document was created with Rmarkdown by Nuria Moragas and Victor Moreno

Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), 28029 Madrid, Spain
Oncology Data Analytics Program, Catalan Institute of Oncology (ICO), L’Hospitalet de Llobregat, 08908 Barcelona, Spain
Colorectal Cancer Group, ONCOBELL Program, Institut de Recerca Biomedica de Bellvitge (IDIBELL), L’Hospitalet de Llobregat, 08908 Barcelona, Spain
Department of Clinical Sciences, Faculty of Medicine and health Sciences and Universitat de Barcelona Institute of Complex Systems (UBICS), University of Barcelona (UB), L’Hospitalet de Llobregat, 08908 Barcelona, Spain

This work is licensed under CC BY-NC-SA 4.0