The knowledge of complete rather than fragmental haplotypes is critical in many areas of genetics studies and will advance personalized medicine including imputation of low-frequent variants, characterization of variant-disease associations, prediction of the clinical outcomes in tranplantation, drug response, and susceptibility to diseases, etc.
Population-based computational phasing methods cannot infer high-resolution haplotypes at chromosome-length and are not able to phase de novo mutations and rare variants, whereas existing experimental phasing methods usually require specialized equipment to separate single chromosomes, which is time-consuming and expensive. Whole-chromosome haplotype phasing with single diploid cells, either require a large number of sequencing libraries or need the assistance of long-read sequencing technologies to phase dense haplotypes at high accuracy thus makes it infeasible for large-scale applications. Using an individual’s genomic data of single haploid gamete cells provides a promising approach to overcome limitations in other methods for chromosome-length haplotype phasing.
Hapi
a novel easy-to-use and high-efficient algorithm
that only requires 3 to 5 gametes to reconstruct
accurate and high-resolution haplotypes of an individual. The gamete
genotype data may be generated from various platforms including
genotyping arrays and next generation sequencing even with
low-coverage. Hapi
simply takes genotype data of
known hetSNPs in single gamete cells as input and report the
high-resolution haplotypes as well as the confidence level of each
phased hetSNPs. The package also includes a module allowing
downstream analyses and visualization of identified
crossovers in the gametes.
Hapi
package installationHapi
can be easily installed from Github by running the
following command in R:
### Install dependencies ahead
install.packages('devtools')
install.packages('HMM')
devtools::install_github('Jialab-UCR/Hapi')
If the installation fails with the ERROR: object ‘enexprs’ is not
exported by ‘namespace:rlang’, please install the developmental version
of rlang
package first.
Hapi
can also be downloaded and installed locally. The
download link is here.
The core algorithm of
Hapi
to phase a chromosome consists of three main steps:
Haplotype phasing can be completed by running a single function
hapiAutoPhase()
with well-setting parameters.
## chr pos ref alt gmt1 gmt2 gmt3 gmt4 gmt5
## 960055 1 960055 T C <NA> <NA> C <NA> <NA>
## 1078815 1 1078815 C T C <NA> <NA> C <NA>
## 1079001 1 1079001 T A T <NA> <NA> T T
## 1079015 1 1079015 G A G <NA> <NA> G G
## 1079040 1 1079040 C T <NA> T T C <NA>
## 1079112 1 1079112 A G <NA> <NA> G A <NA>
### automatic haplotype phasing
hapOutput <- hapiAutoPhase(gmt = gmt, code = 'atcg')
head(hapOutput)
## chr pos ref alt gmt1 gmt2 gmt3 gmt4 gmt5 hap1 hap2 total rate
## 960055 1 960055 T C <NA> <NA> C <NA> <NA> T C 1 1
## 1078815 1 1078815 C T C <NA> <NA> C <NA> C T 2 1
## 1079001 1 1079001 T A T <NA> <NA> T T T A 3 1
## 1079015 1 1079015 G A G <NA> <NA> G G G A 3 1
## 1079040 1 1079040 C T <NA> T T C <NA> C T 3 1
## 1079112 1 1079112 A G <NA> <NA> G A <NA> A G 2 1
## confidence
## 960055 L
## 1078815 L
## 1079001 H
## 1079015 H
## 1079040 H
## 1079112 L
To take full advantage of matrix manipulation in R language,
A/T/C/G coded genotypes will be first converted to
0/1 coded format with the base2num()
function by providing reference and alternative alleles.
### covert A/T/C/G to 0/1
hetDa <- gmt[,1:4]
ref <- hetDa$ref
alt <- hetDa$alt
gmtDa <- gmt[,-(1:4)]
gmtDa <- base2num(gmt = gmtDa, ref = ref, alt = alt)
head(gmtDa)
## gmt1 gmt2 gmt3 gmt4 gmt5
## 960055 NA NA 1 NA NA
## 1078815 0 NA NA 0 NA
## 1079001 0 NA NA 0 0
## 1079015 0 NA NA 0 0
## 1079040 NA 1 1 0 NA
## 1079112 NA NA 1 0 NA
hetSNPs with potential genotyping errors can be detected and filtered
out by a Hidden MarKov Model (HMM) in the hapiErrorFitler()
function. The transition and emission probabilities can be defined by
users according to genotyping error rate and recombination
frequency.
### define HMM probabilities
hmm = initHMM(States=c("S","D"), Symbols=c("s","d"),
transProbs=matrix(c(0.99999,0.00001,0.00001,0.99999),2),
emissionProbs=matrix(c(0.99,0.01,0.01,0.99),2),
startProbs = c(0.5,0.5))
hmm
## $States
## [1] "S" "D"
##
## $Symbols
## [1] "s" "d"
##
## $startProbs
## S D
## 0.5 0.5
##
## $transProbs
## to
## from S D
## S 0.99999 0.00001
## D 0.00001 0.99999
##
## $emissionProbs
## symbols
## states s d
## S 0.99 0.01
## D 0.01 0.99
## 35 hetSNPs with potential genotyping errors are filtered out !
For low coverage sequencing, hetSNPs that are genotyped in at least 3 gametes can be selected to form a ‘precursor’ framework for the draft haplotype construction. For dataset with high genotyping rate, hetSNPs that are observed in more gametes may be selected.
## Number of hetSNPs in the framework: 21074
Missing genotypes in each gamete are iteratively imputed by observed
genotypes in other gametes to faciliate the draft haplotype inference.
In the hapiImpute()
function, nSPT
specifies
the minumum number of supporting gametes to impute a hetSNP. By default,
nSPT=2
, which means a hetSNP in a gamete can be imputed
only if imputations are supported by more than 2 gametes and no
imputation conflict from different supporting gametes.
allowNA
is the maximum number of gametes with NA
values at a locus that are allowed after imputation. This parameter is
useful only when the missing genotype rate is extremely high (eg.,
>80%) thus few markers can be completely imputed acroass all gamete
cells. By default, allowNA=0
. The markers, usually of a
small number, with missing data that cannot be fully resolved by
imputation are eliminated from the framework.
## Number of hetSNPs after imputation: 20994
## gmt1 gmt2 gmt3 gmt4 gmt5
## 1135959 0 1 1 0 0
## 1136439 0 1 1 0 0
## 1136733 0 1 1 0 0
## 1137140 0 1 1 0 0
## 1137167 0 1 1 0 0
## 1137267 0 1 1 0 0
With the imputed genotype data, draft haplotypes can be constructed
based upon majority voting strategy implemented in the
hapiPhase()
function. The majority voting strategy is used
to infer draft haplotypes by counting the number of links between two
adjacent markers. Because recombination frequency is very low, the
appearance of links that represents true haplotypes (hap-links)
should be much more frequent than those generated by crossovers
(cv-links). Draft haplotypes can be determined through voting
for the major links in sliding windows of 2 hetSNPs along the whole
chromosome.
## hap haplink cvlink tlink ratio
## 1135959 0 5 0 5 0
## 1136439 0 5 0 5 0
## 1136733 0 5 0 5 0
## 1137140 0 5 0 5 0
## 1137167 0 5 0 5 0
## 1137267 0 5 0 5 0
## hap haplink cvlink tlink ratio
## 2665203 0 4 1 5 0.25
## 4685329 0 4 1 5 0.25
## 11730915 0 4 1 5 0.25
## 29414096 0 4 1 5 0.25
## 29465722 0 4 1 5 0.25
## 35178660 0 4 1 5 0.25
## 35322900 0 4 1 5 0.25
## 133530288 0 4 1 5 0.25
## 141518185 0 4 1 5 0.25
## 144823941 0 4 1 5 0.25
The first step of proofreading is to filter out small regions (eg. < 1M) with multiple cv-links. Erroneous/ambiguous genotypes in complex genomic regions will largely affect the accuracy of imputation and thus impact the determination of hap-links by majority voting. If multiple cv-links are identified in a very small region, hetSNPs in this region will be eliminated from the frame and new draft haplotypes will be inferred again.
### identification of clusters of cv-links
cvCluster <- hapiCVCluster(draftHap = draftHap, cvlink = 2)
cvCluster
## [1] left right
## <0 rows> (or 0-length row.names)
### determine hetSNPs in small regions involving multiple cv-links
filter <- c()
for (i in 1:nrow(cvCluster)) {
filter <- c(filter, which (rownames(draftHap) >= cvCluster$left[i] &
rownames(draftHap) <= cvCluster$right[i]))
}
length(filter)
## [1] 0
### filter out hetSNPs in complex regions and infer new draft haplotypes
if (length(filter) > 0) {
imputedFrame <- imputedFrame[-filter, ]
draftHap <- hapiPhase(imputedFrame)
}
In the second step of proofreading, a Maximum Parsimony of
Recombination (MPR) strategy is adopted to proofread draft haplotypes at
genomic positions where there are potential incorrect cv-links.
By MPR, we believe that the most likely haplotype should be the one
generating less number of crossovers across all gamete cells. The
cvlink
parameter is used to determine the positions for
dividing draft haplotypes into blocks for proofreading. With 5 or more
gametes, positions that harboring more than 2 cv-links across
all the gametes will be used by default. With 3 or 4 gametes, probably
all the positions with cv-links should be proofread.
## Size of blocks: 132,119,600,3957,53,1340,75,13412,555,264,487
## 2 blocks are removed !
## Number of crossovers given haplotype 1/2: 1/4
## Number of crossovers given haplotype 1/2: 1/4
## Number of crossovers given haplotype 1/2: 1/4
## Number of crossovers given haplotype 1/2: 0/5
## Number of crossovers given haplotype 1/2: 0/5
## Number of crossovers given haplotype 1/2: 1/4
## Number of crossovers given haplotype 1/2: 1/4
## Number of crossovers given haplotype 1/2: 1/4
## 1135959 1136439 1136733 1137140 1137167 1137267
## 0 0 0 0 0 0
Each gamete chromosome is compared to the draft haplotypes to deduce
gamete-specific haplotypes, with the non-framework markers being phased.
Consensus high-resolution haplotypes are eventually determined with
these gamete-specific haplotypes by the hapiAssembl()
function.
## Good consistence between draft haplotype andgmt1: 1
## Good consistence between draft haplotype andgmt2: 1
## Good consistence between draft haplotype andgmt3: 1
## Good consistence between draft haplotype andgmt4: 1
## Good consistence between draft haplotype andgmt5: 1
## hap1 hap2 total rate confidence
## 960055 0 1 1 1 L
## 1078815 0 1 2 1 H
## 1079001 0 1 3 1 H
## 1079015 0 1 3 1 H
## 1079040 0 1 3 1 H
## 1079112 0 1 2 1 H
When a crossover occurs at the end of a gamete chromosome where
hetSNPs are not enclosed in the framework, it becomes very challenging
to correctly infer the haplotypes for this region. Hapi
employs an additional capping strategy to polish two ends of chromosomal
haplotypes in the hapiAssembleEnd()
function.
consensusHap <- hapiAssembleEnd(gmt = gmtDa, draftHap = finalDraft,
consensusHap = consensusHap, k = 300)
## Number of reference pollens: 5
##
## Number of reference pollens: 5
### Haplotype 1
hap1 <- sum(consensusHap$hap1==0)
### Haplotype 2
hap2 <- sum(consensusHap$hap1==1)
### Number of unphased hetSNPs
hap7 <- sum(consensusHap$hap1==7)
### Accuracy
max(hap1, hap2)/sum(hap1, hap2)
## [1] 0.9999259
After inference of the high-resolution haplotypes, genotypes are then converted back to A/T/C/G coded format.
### find hetSNP overlaps
snp <- which(rownames(hetDa) %in% rownames(consensusHap))
ref <- hetDa$ref[snp]
alt <- hetDa$alt[snp]
### convert back to A/T/C/G
consensusHap <- num2base(hap = consensusHap, ref = ref, alt = alt)
head(consensusHap)
## hap1 hap2 total rate confidence
## 960055 T C 1 1 L
## 1078815 C T 2 1 L
## 1079001 T A 3 1 H
## 1079015 G A 3 1 H
## 1079040 C T 3 1 H
## 1079112 A G 2 1 L
## chr pos ref alt gmt1 gmt2 gmt3 gmt4 gmt5 hap1 hap2 total rate
## 960055 1 960055 T C <NA> <NA> C <NA> <NA> T C 1 1
## 1078815 1 1078815 C T C <NA> <NA> C <NA> C T 2 1
## 1079001 1 1079001 T A T <NA> <NA> T T T A 3 1
## 1079015 1 1079015 G A G <NA> <NA> G G G A 3 1
## 1079040 1 1079040 C T <NA> T T C <NA> C T 3 1
## 1079112 1 1079112 A G <NA> <NA> G A <NA> A G 2 1
## confidence
## 960055 L
## 1078815 L
## 1079001 H
## 1079015 H
## 1079040 H
## 1079112 L
The visualization function hapiGameteView()
is provided
here to view haplotypes and crossovers in a single gamete cell.
## chr pos hap
## 1 1 3787130 NA
## 2 1 3793818 NA
## 3 1 3798306 0
## 4 1 3823395 NA
## 5 1 4253891 0
## 6 1 4259253 0
## length cenStart cenEnd
## chr1 249250621 121535434 124535434
## chr2 243199373 92326171 95326171
## chr3 198022430 90504854 93504854
## chr4 191154276 49660117 52660117
## chr5 180915260 46405641 49405641
## chr6 171115067 58830166 61830166
Mapping recombinations at the individual level is a very important
by-product of gamete-based phasing. The crossover analysis
module in Hapi
provides an ideal tool for
investigating recombination events in single gamete cells, which has the
potential to be applied in clinical labs to manage human diseases that
are associated with abnormal recombination. In addition, it can also be
used to monitor the crossovers on plant genomes to facilitate more rapid
introgression of target genes or to break up undesirable linkages for
crop improvement.
With the high-resolution chromosome-length haplotypes, crossovers in
each gamete can be easily identified by the
hapiIdentifyCV()
function which also adopts a HMM.
## hap1 hap2
## 960055 T C
## 1078815 C T
## 1079001 T A
## 1079015 G A
## 1079040 C T
## 1079112 A G
## gmt1 gmt2 gmt3 gmt4 gmt5
## 960055 <NA> <NA> C <NA> <NA>
## 1078815 C <NA> <NA> C <NA>
## 1079001 T <NA> <NA> T T
## 1079015 G <NA> <NA> G G
## 1079040 <NA> T T C <NA>
## 1079112 <NA> <NA> G A <NA>
## gmt start end pos res
## 1 gmt1 144704855 144823941 144764398 119085
## 2 gmt2 4670029 4685329 4677679 15299
## 3 gmt2 11605193 11730555 11667874 125361
## 4 gmt2 141005431 141518185 141261808 512753
## 5 gmt4 29385454 29414096 29399775 28641
## 6 gmt4 29459037 29465722 29462379 6684
## 7 gmt4 35153363 35178660 35166011 25296
## 8 gmt4 35236017 35322900 35279458 86882
## 9 gmt5 2488318 2665203 2576760 176884
## 10 gmt5 133333150 133441645 133387397 108494
Resolution of recombination events, distances between adjacent crossovers, and the distribution of recombinations along each chromosome can be visualized by functions provided in the crossover analysis module.
hapiCVResolution()
function can generate a histogram of
resolution of crossovers.
Distribution of distances between adjacent crossovers on the same
chromosome can be visualized by the hapiCVDistance()
function,and may also be used to study coexisting crossovers as well as
the crossover interference phenomenon.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] HMM_1.0.1 Hapi_0.0.3
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.5 cli_3.6.4 knitr_1.49 rlang_1.1.5
## [5] xfun_0.51 jsonlite_1.9.0 glue_1.8.0 colorspace_2.1-1
## [9] buildtools_1.0.0 htmltools_0.5.8.1 maketools_1.3.2 sys_3.4.3
## [13] sass_0.4.9 scales_1.3.0 rmarkdown_2.29 grid_4.4.2
## [17] tibble_3.2.1 evaluate_1.0.3 munsell_0.5.1 jquerylib_0.1.4
## [21] fastmap_1.2.0 yaml_2.3.10 lifecycle_1.0.4 compiler_4.4.2
## [25] pkgconfig_2.0.3 digest_0.6.37 R6_2.6.1 pillar_1.10.1
## [29] magrittr_2.0.3 bslib_0.9.0 tools_4.4.2 gtable_0.3.6
## [33] ggplot2_3.5.1 cachem_1.1.0