Skip to contents

upscale_typings() takes in a (vector of) low-resolution HLA typing string(s), and "upscales" it to(two-field) high resolution based on haplotype frequencies from the NMDP registry.

Usage

upscale_typings(
  filepath,
  typings,
  loci_input = c("A", "B", "DRB1", "DRB.", "DQB1"),
  loci_output = c("A", "B", "C", "DRB1", "DRB.", "DQB1"),
  population = "EURCAU",
  n_haplos = NULL,
  n_genos = 1,
  as_list = FALSE
)

Arguments

filepath

String with the path to the HLA-A~C~B~DRB1~DQB1.xlsx file as downloaded from https://frequency.nmdp.org.

typings

String or character vector of HLA typings, with space-separated alleles in serological notation.

loci_input

Character vector of loci in the input genotype to be used for upscaling. Must be a subset of c("A", "B", "C", "DRB1", "DRB.", "DQB1"), where DRB. is DRB3/4/5. If one of these loci does not occur in the input typing, it will be ignored.

loci_output

Character vector of loci that the upscaled, output genotype should contain. Can be different from loci_input, for example if there's no typing at all for a certain locus and you want to infer it from the haplotypes. Must be a subset of c("A", "B", "C", "DRB1", "DRB.", "DQB1"), where DRB. is DRB3/4/5.

population

String specifying which population group to use the haplotype frequencies from. Must correspond to one of the abbreviations that can be found on https://haplostats.org Defaults to "EURCAU" (the largest population in the dataset).

n_haplos

Number of most frequent haplotypes to use for the upscaling, e.g. 5000 to consider only the 5000 haplotypes with the highest frequency (the rest is discarded). Defaults to all haplotypes with a non-zero frequency in the selected population.

n_genos

Number of output genotypes to return for each input genotype, sorted by probability (frequency) of the output genotypes. Defaults to the most likely genotype only (i.e., 1).

as_list

Boolean (TRUE or FALSE) that determines whether to return the result as a single dataframe, or a list of dataframes: one for each input typing (the latter can be useful when the input typings also live in a data frame; see examples).

Value

A (list of) dataframe(s) with the upscaling results, containing the following columns:

  1. id_input_typing: Sequential identifier for the input typing

  2. id_unphased_geno: Unique identifier for each output unphased genotype

  3. unphased_geno: Upscaled genotype

  4. unphased_freq: Frequency of upscaled genotype

  5. unphased_prob: Probability of upscaled genotype

  6. phased_freq: Frequency of phased genotype (= pair of haplotypes) that make up the unphased genotype (can be many-to-one)

  7. phased_prob: Probability of the phased genotype

  8. haplo_freq_1: Frequency of the first haplotype in the phased genotype

  9. haplo_rank_1: Rank (descending) of the frequency of this haplotype

  10. haplo_freq_2: Frequency of the 2nd haplotype in the phased genotype

  11. haplo_rank_2: Rank (descending) of the frequency of this haplotype

Details

This function uses haplotype frequencies published by the NMDP at https://frequency.nmdp.org. You'll need to login and accept the license to obtain the data, which is why we cannot distribute it with this package.

Imputation algorithm

Roughly, the function performs the following steps:

  1. Translate all haplotype alleles from high-resolution to their serological equivalents

  2. Select compatible haplotypes, i.e. those haplotypes with alleles that all occur in the input genotype

  3. Combine all compatible haplotypes into phased genotypes (i.e. unique haplotype pairs). Fully homozygous haplotypes are not considered

  4. Calculate the frequency and probability of the phased genotypes

  5. Combine phased genotypes into unique unphased genotypes, and calculate their frequency and probability

For a detailed explanation of the terminology and imputation algorithm, see the following references:

  • Geffard et. al., Easy-HLA: a validated web application suite to reveal the full details of HLA typing, Bioinformatics, Volume 36, Issue 7, April 2020, Pages 2157-2164, https://doi.org/10.1093/bioinformatics/btz875

  • Madbouly, A., Gragert, L., Freeman, J., Leahy, N., Gourraud, P.-.-A., Hollenbach, J.A., Kamoun, M., Fernandez-Vina, M. and Maiers, M. (2014), Validation of statistical imputation of allele-level multilocus phased genotypes from ambiguous HLA assignments. Tissue Antigens, 84: 285-292. https://doi.org/10.1111/tan.12390

Examples

if (FALSE) { # \dontrun{
upscale_typings(
  filepath = "~/Downloads/A~C~B~DRB3-4-5~DRB1~DQB1.xlsx",
  typing = "A24 A28 B35 B61 DR4 DR11"
)

# If your GL Strings are in a data frame with some ID'ing columns that you
# If you've got more than one typing to upscale, perhaps along with some
# ID'ing columns (e.g. patient ID), it's probably best to put them in a
# data frame and call `upscale_typings()` on your data frame:
library(tidyverse)

typing_df <- tibble(
  id = c("001", "002"),
  input_typings = c(
    "A24 A28 B35 B61 DR4 DR11",
    "A2 A3 B52 B35 Cw4 DR11 DR52 DQ3"
  )
)
typing_df |>
  mutate(geno_df = upscale_typings(
    "~/Downloads/A~C~B~DRB3-4-5~DRB1~DQB1.xlsx",
    input_typings,
    as_list = TRUE
  )) |>
  unnest(geno_df)
} # }