Impute the resolution of an HLA typing from low to high (second field)
Source:R/upscaling.R
upscale_typings.Rd
upscale_typings()
takes in a (vector of) low-resolution HLA typing
string(s), and "upscales" it to(two-field) high resolution based on
haplotype frequencies from the NMDP registry.
Arguments
- filepath
String with the path to the
HLA-A~C~B~DRB1~DQB1.xlsx
file as downloaded from https://frequency.nmdp.org.- typings
String or character vector of HLA typings, with space-separated alleles in serological notation.
- loci_input
Character vector of loci in the input genotype to be used for upscaling. Must be a subset of
c("A", "B", "C", "DRB1", "DRB.", "DQB1")
, whereDRB.
is DRB3/4/5. If one of these loci does not occur in the input typing, it will be ignored.- loci_output
Character vector of loci that the upscaled, output genotype should contain. Can be different from
loci_input
, for example if there's no typing at all for a certain locus and you want to infer it from the haplotypes. Must be a subset ofc("A", "B", "C", "DRB1", "DRB.", "DQB1")
, whereDRB.
is DRB3/4/5.- population
String specifying which population group to use the haplotype frequencies from. Must correspond to one of the abbreviations that can be found on https://haplostats.org Defaults to
"EURCAU"
(the largest population in the dataset).- n_haplos
Number of most frequent haplotypes to use for the upscaling, e.g.
5000
to consider only the 5000 haplotypes with the highest frequency (the rest is discarded). Defaults to all haplotypes with a non-zero frequency in the selected population.- n_genos
Number of output genotypes to return for each input genotype, sorted by probability (frequency) of the output genotypes. Defaults to the most likely genotype only (i.e.,
1
).- as_list
Boolean (
TRUE
orFALSE
) that determines whether to return the result as a single dataframe, or a list of dataframes: one for each input typing (the latter can be useful when the input typings also live in a data frame; see examples).
Value
A (list of) dataframe(s) with the upscaling results, containing the following columns:
id_input_typing
: Sequential identifier for the input typingid_unphased_geno
: Unique identifier for each output unphased genotypeunphased_geno
: Upscaled genotypeunphased_freq
: Frequency of upscaled genotypeunphased_prob
: Probability of upscaled genotypephased_freq
: Frequency of phased genotype (= pair of haplotypes) that make up the unphased genotype (can be many-to-one)phased_prob
: Probability of the phased genotypehaplo_freq_1
: Frequency of the first haplotype in the phased genotypehaplo_rank_1
: Rank (descending) of the frequency of this haplotypehaplo_freq_2
: Frequency of the 2nd haplotype in the phased genotypehaplo_rank_2
: Rank (descending) of the frequency of this haplotype
Details
This function uses haplotype frequencies published by the NMDP at https://frequency.nmdp.org. You'll need to login and accept the license to obtain the data, which is why we cannot distribute it with this package.
Imputation algorithm
Roughly, the function performs the following steps:
Translate all haplotype alleles from high-resolution to their serological equivalents
Select compatible haplotypes, i.e. those haplotypes with alleles that all occur in the input genotype
Combine all compatible haplotypes into phased genotypes (i.e. unique haplotype pairs). Fully homozygous haplotypes are not considered
Calculate the frequency and probability of the phased genotypes
Combine phased genotypes into unique unphased genotypes, and calculate their frequency and probability
For a detailed explanation of the terminology and imputation algorithm, see the following references:
Geffard et. al., Easy-HLA: a validated web application suite to reveal the full details of HLA typing, Bioinformatics, Volume 36, Issue 7, April 2020, Pages 2157-2164, https://doi.org/10.1093/bioinformatics/btz875
Madbouly, A., Gragert, L., Freeman, J., Leahy, N., Gourraud, P.-.-A., Hollenbach, J.A., Kamoun, M., Fernandez-Vina, M. and Maiers, M. (2014), Validation of statistical imputation of allele-level multilocus phased genotypes from ambiguous HLA assignments. Tissue Antigens, 84: 285-292. https://doi.org/10.1111/tan.12390
Examples
if (FALSE) { # \dontrun{
upscale_typings(
filepath = "~/Downloads/A~C~B~DRB3-4-5~DRB1~DQB1.xlsx",
typing = "A24 A28 B35 B61 DR4 DR11"
)
# If your GL Strings are in a data frame with some ID'ing columns that you
# If you've got more than one typing to upscale, perhaps along with some
# ID'ing columns (e.g. patient ID), it's probably best to put them in a
# data frame and call `upscale_typings()` on your data frame:
library(tidyverse)
typing_df <- tibble(
id = c("001", "002"),
input_typings = c(
"A24 A28 B35 B61 DR4 DR11",
"A2 A3 B52 B35 Cw4 DR11 DR52 DQ3"
)
)
typing_df |>
mutate(geno_df = upscale_typings(
"~/Downloads/A~C~B~DRB3-4-5~DRB1~DQB1.xlsx",
input_typings,
as_list = TRUE
)) |>
unnest(geno_df)
} # }