Impute the resolution of an HLA typing from low to high (second field)

upscale_typings() takes in a (vector of) low-resolution HLA typing string(s), and "upscales" it to(two-field) high resolution based on haplotype frequencies from the NMDP registry.

Usage

upscale_typings(
  filepath,
  typings,
  loci_input = c("A", "B", "DRB1", "DRB.", "DQB1"),
  loci_output = c("A", "B", "C", "DRB1", "DRB.", "DQB1"),
  population = "EURCAU",
  n_haplos = NULL,
  n_genos = 1,
  as_list = FALSE
)

Arguments

filepath: String with the path to the HLA-A~C~B~DRB1~DQB1.xlsx file as downloaded from https://frequency.nmdp.org.
typings: String or character vector of HLA typings, with space-separated alleles in serological notation.
loci_input: Character vector of loci in the input genotype to be used for upscaling. Must be a subset of c("A", "B", "C", "DRB1", "DRB.", "DQB1"), where DRB. is DRB3/4/5. If one of these loci does not occur in the input typing, it will be ignored.
loci_output: Character vector of loci that the upscaled, output genotype should contain. Can be different from loci_input, for example if there's no typing at all for a certain locus and you want to infer it from the haplotypes. Must be a subset of c("A", "B", "C", "DRB1", "DRB.", "DQB1"), where DRB. is DRB3/4/5.
population: String specifying which population group to use the haplotype frequencies from. Must correspond to one of the abbreviations that can be found on https://haplostats.org Defaults to "EURCAU" (the largest population in the dataset).
n_haplos: Number of most frequent haplotypes to use for the upscaling, e.g. 5000 to consider only the 5000 haplotypes with the highest frequency (the rest is discarded). Defaults to all haplotypes with a non-zero frequency in the selected population.
n_genos: Number of output genotypes to return for each input genotype, sorted by probability (frequency) of the output genotypes. Defaults to the most likely genotype only (i.e., 1).
as_list: Boolean (TRUE or FALSE) that determines whether to return the result as a single dataframe, or a list of dataframes: one for each input typing (the latter can be useful when the input typings also live in a data frame; see examples).

Value

A (list of) dataframe(s) with the upscaling results, containing the following columns:

id_input_typing: Sequential identifier for the input typing
id_unphased_geno: Unique identifier for each output unphased genotype
unphased_geno: Upscaled genotype
unphased_freq: Frequency of upscaled genotype
unphased_prob: Probability of upscaled genotype
phased_freq: Frequency of phased genotype (= pair of haplotypes) that make up the unphased genotype (can be many-to-one)
phased_prob: Probability of the phased genotype
haplo_freq_1: Frequency of the first haplotype in the phased genotype
haplo_rank_1: Rank (descending) of the frequency of this haplotype
haplo_freq_2: Frequency of the 2nd haplotype in the phased genotype
haplo_rank_2: Rank (descending) of the frequency of this haplotype

Details

This function uses haplotype frequencies published by the NMDP at https://frequency.nmdp.org. You'll need to login and accept the license to obtain the data, which is why we cannot distribute it with this package.

Imputation algorithm

Roughly, the function performs the following steps:

Translate all haplotype alleles from high-resolution to their serological equivalents
Select compatible haplotypes, i.e. those haplotypes with alleles that all occur in the input genotype
Combine all compatible haplotypes into phased genotypes (i.e. unique haplotype pairs). Fully homozygous haplotypes are not considered
Calculate the frequency and probability of the phased genotypes
Combine phased genotypes into unique unphased genotypes, and calculate their frequency and probability

For a detailed explanation of the terminology and imputation algorithm, see the following references:

Geffard et. al., Easy-HLA: a validated web application suite to reveal the full details of HLA typing, Bioinformatics, Volume 36, Issue 7, April 2020, Pages 2157-2164, https://doi.org/10.1093/bioinformatics/btz875
Madbouly, A., Gragert, L., Freeman, J., Leahy, N., Gourraud, P.-.-A., Hollenbach, J.A., Kamoun, M., Fernandez-Vina, M. and Maiers, M. (2014), Validation of statistical imputation of allele-level multilocus phased genotypes from ambiguous HLA assignments. Tissue Antigens, 84: 285-292. https://doi.org/10.1111/tan.12390

Examples

if (FALSE) { # \dontrun{
upscale_typings(
  filepath = "~/Downloads/A~C~B~DRB3-4-5~DRB1~DQB1.xlsx",
  typing = "A24 A28 B35 B61 DR4 DR11"
)

# If your GL Strings are in a data frame with some ID'ing columns that you
# If you've got more than one typing to upscale, perhaps along with some
# ID'ing columns (e.g. patient ID), it's probably best to put them in a
# data frame and call `upscale_typings()` on your data frame:
library(tidyverse)

typing_df <- tibble(
  id = c("001", "002"),
  input_typings = c(
    "A24 A28 B35 B61 DR4 DR11",
    "A2 A3 B52 B35 Cw4 DR11 DR52 DQ3"
  )
)
typing_df |>
  mutate(geno_df = upscale_typings(
    "~/Downloads/A~C~B~DRB3-4-5~DRB1~DQB1.xlsx",
    input_typings,
    as_list = TRUE
  )) |>
  unnest(geno_df)
} # }