Genomics Reporting Implementation Guide
3.0.1-SNAPSHOT - Ballot International flag

Genomics Reporting Implementation Guide, published by HL7 International / Clinical Genomics. This guide is not an authorized publication; it is the continuous build for version 3.0.1-SNAPSHOT built by the FHIR (HL7® FHIR® Standard) CI Build. This version is based on the current content of and changes regularly. See the Directory of published versions

Appendix E: External Coding Systems

Genomics covers a broad landscape involving many organizations outside of HL7 that have established codes and name systems for different use cases. The HL7 Terminology Authority (HTA) governs the use of terminologies that are external to HL7 and they have begun to document these. As a part of the Terminology Services Management Group, the HTA is working with these organizations to establish canonical URIs that can be used as HL7 code and name systems. HL7 has established (THO) as the authoritative source for code system information, including external terminologies Many of these are also listed in the core specification guidance.

We have listed many, but not all, of genomics-related systems below. In this table, we provide links for more information -- to the THO or HTA entries when available, or to the organization home page if not. We also list a URI that has been approved by the THO/HTA or by Vocabulary Work Group, and a coding system label defined in the HL7 Version 2.5.1 Implementation Guide: Laboratory Results Interface (LRI).

For those organizations that list the URI as "not assigned yet," check the THO or HTA sites to see if it has been added since the publication of this IG. If not, we recommend following the External Code System Owner Engagement Processing. As a last resort, we suggest using the organization site URL, change any "https" to "http", and remove trailing slashes.

Please contact the Clinical Genomics Workgroup co-chairs if you have any questions about using these genomics code systems.

Coding System Name
DescriptionAdditional informationURILRI Code
Database of privately and publicly funded clinical studies conducted around the world.
ClinVar Variant ID
ClinVar processes submissions reporting variants found in patient samples, assertions made regarding their clinical significance, information about the submitter, and other supporting data. The alleles described in submissions are mapped to reference sequences, and reported according to the HGVS standard. ClinVar includes simple and complex variants composed of multiple small variants. However, it now also includes large structural variants, which have a known clinical implication. So now simple, complex and many structural variants can all be found in ClinVar. The ClinVar records have a field for Allele ID and for Variant ID. We focus on the variant ID in this guide. This coding system uses the variant ID as the code and the variant name from NCBI’s "variant_summary.txt.gz" file as the code’s print string. The "variant_summary.txt.gz" file caries more than 20 useful fields, including the separate components of the variant name, the cytogenetic location, the genomic reference, etc. So based on the Variant ID, you can use ClinVar to find most you would ever want to know about the variant. In the LHC Clinical Table Search Service and LHC-Forms, we have indexed many of these attributes to assist users and applications that need to find the ID for a particular variant. CLINVAR-V
(Wellcome Trust Sanger Institute)
This table includes only simple somatic (cancer) mutations, one per unique mutation ID. The code is the COSMIC mutation ID, and the name is constructed from Ensembl transcript reference sequences and p.HGVS that use the single letter codes for amino acids. It carries fields analogous to most of the key fields in ClinVar, but its reference sequences are Ensembl transcript reference sequences with prefixes of ENST; it specifies amino acid changes with the older HGVS single letter codes and it carries examples of primary cancers and primary tissues - fields that are not in ClinVar. COSMIC's source table includes multiple records per mutation - one per submission. The COSMIC- SimpleVariants table that we have extracted from the original file includes only one record per unique mutation – a total of more than 3 million records. These contents are copyright COSMIC ( LHC has produced a lookup table for these records, and for users to look up particular mutation IDs, both with permission from COSMIC. However, interested parties must contact COSMIC directly for permission to download these records. COSMIC-Smpl, COSMIC-Stru
Cytogenetic (chromosome) location
Chromosome location (AKA chromosome locus or cytogenetic location), is the standardized syntax for recording the position of genes and large variants. It consists of three parts: the Chromosome number (e.g. 1-22, X, Y), an indicator of which arm – either “p” for the short or “q” for the long, and then generally a series of numbers separated by dots that indicate the region and any applicable band, sub-band, and sub-sub-band of the locus (e.g. 2p16.3). There are other conventions for reporting ranges and locations at the ends of the chromosomes. The table of these chromosome locations was loaded with all of the locations found in NCBI’s ClinVar variation tables. It will expand as additional sources become available. This does not include all finely grained chromosome locations that exist. Users can add to it as needed. (not assigned yet) Chrom-Loc
(DBSNP : Single Nucleotide Polymorphism database )
The Short Genetic Variations database (dbSNP) is a public-domain archive maintained by NCBI for a broad collection of short genetic polymorphisms. The SNP ID is unique for each position and length of DNA change. For example, a change of three nucleotides will have a different SNP ID than a change of four nucleotides at the same locus, but the code will be the same for all changes at the same locus and with the same length. So, to specify a variation in relation to a reference sequence, both the alt allele and the SNP code must be included. dbSNP
dbVar is NCBI's database of genomic structural variations (including copy number variants) that are larger than 50 contiguous base pairs. It is the complement of dbSNP, which identifies variants occurring in 50 or fewer contiguous base pairs. dbVar contains insertions, deletions, duplications, inversions, multi-nucleotide substitutions, mobile element insertions, translocations, and complex chromosomal rearrangements. dbVar carries structured Germline and Somatic variants in separate files. Accordingly, we have divided the coding system for dbVar the same way. This coding system represents the Germline dbVar variants. Its record ID may begin with one of four prefixes: nsv, nssv, esv and essv. These are accession prefixes for variant regions (nsv) and variant calls (or instances, nssv), respectively. Typically, one or more variant instances (nssv – variant calls based directly on experimental evidence) are merged into one variant region (nsv – a pair of start-stop coordinates reflecting the submitters’ assertion of the region of the genome that is affected by the variant instances). The “n” preceding sv or indicates that the variants were submitted to NCBI (dbVar). The prefix, “e” for esv and essv represent variant entities (corresponding to NCBI’s nsv and nssv) that were submitted to EBI (DGVa). The relation between variant call, and variant region, instances is many to one. The LHC lookup table for dbVar germline variations includes both variant instances (essv or nssv) and the variant region records (nsv, esv). Users can sub-select by searching on the appropriate prefix. (not assigned yet) dbVar-GL, dbVar-Som
ENSEMBL genomic reference sequence
(European Bioinformatics Institute)
Set of Ensembl gene reference sequences whose identifiers have a prefix of "ENSG." It only includes genomic sequences associated with genes and uses the whole build plus the chromosome number to identify chromosome reference sequences, rather than a separate set of reference sequence identifiers as NCBI does. LHC has not yet produced a convenient look up table for these files. Ensembl-G
ENSEMBL protein reference sequence
(European Bioinformatics Institute)
Set of Ensembl gene reference sequences whose identifiers have a prefix of "ENSP" and correspond to NCBI's "NP_" reference sequence identifiers. LHC has not yet produced a convenient look up table for these files. Ensembl-P
ENSEMBL transcript reference sequence
(European Bioinformatics Institute)
Set of Ensembl protein reference sequences. Their identifiers are distinguished by the prefix of "ENST," and correspond to NCBI's "NM_" reference sequence identifiers. LHC has not yet produced a convenient look up table for these files. Ensembl-T
Gene Code
When applicable, this variable identifies the gene on which the variant is located. The gene identifier is also carried in the transcript reference sequence database, and is part of a full HGVS expression. Not all genes have HGNC names and codes so NCBI has created gene IDs that cover the genes that are not registered by HGNC. (not assigned yet) NCBI-gene code
GL String
The GL String Code (GLSC) code system combines Genotype List String (GL String) grammar with established HLA and KIR nomenclatures, thereby specifying a syntax for encoding of nomenclature-level genotyping results. The established HLA and KIR nomenclatures are primarily concerned with identification and characterization of individual alleles, and do not incorporate a grammar for genotyping results. GL String grammar is a string format for genotyping results from genetic systems with defined nomenclatures, such as HLA and KIR. In addition to genotyping results, the grammar can also describe data analysis artifacts such as multi-locus (multi-gene) haplotypes. From a terminology standpoint, the result of adding a compositional grammar to an established code system such as HLA or KIR nomenclature is a new code system, because the grammar may be used to construct expressions not explicitly defined in the original system as concept codes. (not assigned yet)
(HGNC: HUGO Gene Nomenclature Committee)

The HGNC gene table carries the gene ID, gene symbol and full gene name. Gene IDs must be sent as codes and begin with "HGNC:". Gene symbols should be sent as display. GENE IDs are specific to the species, whereas gene symbols and names are shared by all species with the same gene. The HGNC-Symb table provided by HUGO carries only human genes and is available in a table by LHC. The code for this coding system is the HGNC gene code, the "name" or print string is the HGNC gene symbol. More than 28,000 human gene symbols and names have been assigned so far, including almost all of the protein coding genes. NCBI creates what might be thought of as interim codes. Older systems might send HGNC gene IDs without the "HGNC:" prefix. This is not following the guidance of the creators of the code system, and introduces ambiguity. So caution must be taken to confirm alignment.

HGNC also provides an index on gene families/groups. GeneGroup IDs do not begin with "HGNC:", so care must be made to ensure alignment of concepts when viewing an HGNC ID from an older system that may be referring to the GeneID and not a gene group. For example, 588 refers to the HLA gene family, but HGNC:588 identifies the ATG12 gene. (Gene) (Gene Family/Group)

HGVS - Genomic syntax
(HGVS: Human Genome Variation Society)
HGVS syntax that describes the variations (mutations) at the genome level (the DNA before it is spliced to remove introns). The genomic syntax statements which can describe simple or structural variants are distinguished by a leading "g." HGVS.g, HGVS.p, HGVS.c
HLA Nomenclature
(European Bioinformatics Institute)
Human leukocyte antigen (HLA) is a gene family found in the Major Histocompatibility Complex (MHC) of Chromosome 6 in humans. This family includes more than 50 genes. A subset of these are commonly used as markers for histocompatibility testing for stem cell and solid organ transplantation, drug sensitivity, and disease association. The WHO Nomenclature Committee for Factors of the HLA System is responsible for a common nomenclature of HLA alleles, allele sequences, and quality control, to communicate histocompatibility typing information to match donors and recipients. An HLA allele is defined as any set of variations found on a sequence of DNA comprising a HLA gene. So, if there are five variations found in this one gene sequence, this set is defined as one allele (vs. the definition of an allele being the variation found between the test specimen and the reference along a contiguous stretch of DNA). In the case of HLA, the contiguous stretch of DNA represents the entire gene, and the variations do not need to be contiguous within the gene sequence. Each HLA allele name has a unique name consisting of the gene name followed by up to four fields, each containing at least two digits, separated by colons. There are also optional suffixes added to indicate expression status. Individual HLA genes may have thousands of different alleles. For the full specification, please go to this website: HLA nomenclature can also be used to represent sets of alleles that share sequence identity in the Antigen Recognition Site (ARS). G-groups are alleles that have identical DNA sequences in the ARS, while P-groups are alleles that have identical protein sequences in the ARS. These are described, respectively, in and HLA-Allele
Human Phenotype Ontology
(HPO - Human Phenotype Ontology)
The Human Phenotype Ontology (HPO) aims to provide a standardized vocabulary of phenotypic abnormalities encountered in human disease. Each term in the HPO describes a phenotypic abnormality, such as atrial septal defect. HPO
Mondo Disease Ontology
(MONDO - Mondo Disease Ontology)
The Mondo Disease Ontology is a semi-automatically constructed ontology that merges in multiple disease resources to yield a coherent merged ontology that contains cross-species disease terminology. Numerous sources for disease definitions and data models currently exist, which include HPO, OMIM, SNOMED CT, ICD, PhenoDB, MedDRA, MedGen, ORDO, DO, GARD, etc; however, these sources partially overlap and sometimes conflict, making it difficult to know definitively how they relate to each other. This has resulted in a proliferation of mappings between disease entries in different resources; however mappings are problematic: collectively, they are expensive to create and maintain. Most importantly, the mappings lack completeness, accuracy, and precision; as a result, mapping calls are often inconsistent between resources. The UMLS provides intermediate concepts through which other resources can be mapped, but these mappings suffer from the same challenges: they are not guaranteed to be one-to-one, especially in areas with evolving disease concepts such as rare disease. MONDO
(International System for Human Cytogenetic Nomenclature)
Like HGVS, The International System for Human Cytogenetic Nomenclature (ISCN) is a syntax. It came out of cytopathology and deals with reporting karyotypes down to the chromosome fusions and many types of small copy number variants. However, cytogenetics is out of the scope in this guide. ISCN syntax is often used to report large deletion-duplications in structural variants and to include other variants that have been observed. ISCN
(Regenstrief Institute)
Logical Observation Identifiers Names and Codes (LOINC®) provides a set of universal codes and names for identifying laboratory and other clinical observations. One of the main goals of LOINC is to facilitate the exchange and pooling of results for clinical care, outcomes management, and research. LOINC was initiated by Regenstrief Institute research scientists who continue to develop it with the collaboration of the LOINC Committee. LN
LRG : Locus Reference Genomic Sequences
(LRG : Locus Reference Genomic)
LRG is a manually curated record that contains stable, and thus un-versioned, reference sequences designed specifically for reporting sequence variants with clinical implications. It provides a genomic DNA sequence representation of a single gene that is idealized, has a permanent ID (with no versioning), and core content that never changes. Their database includes maps to NCBI, Ensembl and UCSC reference sequences. It contained sequences for a total of 1073 genes as of April 2016, with identifiers of the form: "LRG_####", where#### can be from 1 to N, and N is the last gene processed. See PMIDs: 24285302, 20398331, and 20428090 for more information. LRG-RefSeq
PHARMGKB: Pharmacogenomic Knowledge Base
(PHARMGKB : Pharmacogenomic Knowledge Base )
PharmGKB is a comprehensive resource that curates knowledge about the impact of genetic variation on drug response for clinicians and researchers.
PubMed® comprises citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full text content from PubMed Central and publisher web sites.
Subset of NCBI Human RefSeqs with prefix of NC_ or NG_. Those prefixed with "NC_" represent the whole genomic RefSeq for individual chromosomes. Those prefixed with "NG_" represent genes with all of their introns and flanking regions and other larger or smaller genomic sequences. These are available separately in the NCBI source data file, which includes all human RefSeqs (including those with prefix of NR_ or XM_): RefSeq-G, RefSeq-P, RefSeq-T
Sequence Ontology
(Sequence Ontology)
Collaborative ontology project for the definition of sequence features used in biological sequence annotation
Pharmacogenomic Star Alleles (PharmVar Pharmacogene Variation Consortium) The star allele nomenclature is commonly used in pharmacogenomics as shorthand to specify one or more specific variants in a gene that is known to impact drug metabolism or response. A star allele can identify either a single variant or a group of variants found in cis, and therefore it usually represents a haplotype. A star allele name is composed of the gene symbol and an allele number, separated by an asterisk, e.g. TPMT*2. By convention, the *1 allele represents the allele that contains the "reference" sequence, although that is not true in all cases. Closely related alleles may be assigned a common number and be differentiated by a unique letter that specifies the suballele (e.g., TPMT*3A, TPMT*3B). Pharmacogenomics tests commonly report patient phenotypes as diplotypes, i.e. *1/*3A. The star nomenclature system is inadequately defined and inconsistently adopted. Therefore, although the system we are proposing supports the inclusion of pharmacogenomics star alleles as a legacy syntax, we strongly encourage messages that include star alleles to rigorously define those alleles through other means. (not assigned yet) Star-Allele