Genomics Reporting Implementation Guide
3.0.0 - release International flag

Genomics Reporting Implementation Guide, published by HL7 International / Clinical Genomics. This guide is not an authorized publication; it is the continuous build for version 3.0.0 built by the FHIR (HL7® FHIR® Standard) CI Build. This version is based on the current content of https://github.com/HL7/genomics-reporting/ and changes regularly. See the Directory of published versions

Appendix E: External Coding Systems

Genomics covers a broad landscape involving many organizations outside of HL7 that have established codes and name systems for different use cases. The HL7 Terminology Authority (HTA) governs the use of terminologies that are external to HL7 and they have begun to document these. As a part of the Terminology Services Management Group, the HTA is working with these organizations to establish canonical URIs that can be used as HL7 code and name systems. HL7 has established terminology.hl7.org (THO) as the authoritative source for code system information, including external terminologies Many of these are also listed in the core specification guidance.

We have listed many, but not all, of genomics-related systems below. In this table, we provide links for more information – to the THO or HTA entries when available, or to the organization home page if not. We also list a URI that has been approved by the THO/HTA or by Vocabulary Work Group, and a coding system label defined in the HL7 Version 2.5.1 Implementation Guide: Laboratory Results Interface (LRI).

For those organizations that list the URI as “Not Assigned,” or “Not Official,” check the THO or HTA sites to see if it has been added since the publication of this IG. If not, we recommend following the External Code System Owner Engagement Processing. As a last resort, we suggest using the organization site URL, change any “https” to “http”, and remove trailing slashes.

Please contact the Clinical Genomics Workgroup co-chairs if you have any questions about using these genomics code systems.

Coding System Description URI

LRI Code
ClinicalTrials.gov Database of privately and publicly funded clinical studies conducted around the world. URI: http://clinicaltrials.gov (Not Official)

LRI Code: None
ClinVar Variant ID ClinVar processes submissions reporting variants found in patient samples, assertions made regarding their clinical significance, information about the submitter, and other supporting data. The alleles described in submissions are mapped to reference sequences, and reported according to the HGVS standard. ClinVar includes simple and complex variants composed of multiple small variants. However, it now also includes large structural variants, which have a known clinical implication. So now simple, complex and many structural variants can all be found in ClinVar. The ClinVar records have a field for Allele ID and for Variant ID. We focus on the variant ID in this guide. This coding system uses the variant ID as the code and the variant name from NCBI’s “variant_summary.txt.gz” file as the code’s print string. The “variant_summary.txt.gz” file caries more than 20 useful fields, including the separate components of the variant name, the cytogenetic location, the genomic reference, etc. So based on the Variant ID, you can use ClinVar to find most you would ever want to know about the variant. In the LHC Clinical Table Search Service and LHC-Forms, we have indexed many of these attributes to assist users and applications that need to find the ID for a particular variant. URI: http://www.ncbi.nlm.nih.gov/clinvar

LRI Code: CLINVAR-V
COSMIC COSMIC, the Catalogue Of Somatic Mutations In Cancer, is the world’s largest and most comprehensive resource for exploring the impact of somatic mutations in human cancer. COSMIC includes only simple somatic (cancer) mutations, one per unique mutation ID. The code is the COSMIC mutation ID, and the name is constructed from Ensembl transcript reference sequences and p.HGVS that use the single letter codes for amino acids. It carries fields analogous to most of the key fields in ClinVar, but its reference sequences are Ensembl transcript reference sequences with prefixes of ENST; it specifies amino acid changes with the older HGVS single letter codes and it carries examples of primary cancers and primary tissues - fields that are not in ClinVar. COSMIC’s source table includes multiple records per mutation - one per submission. The COSMIC- SimpleVariants table that we have extracted from the original file includes only one record per unique mutation – a total of more than 3 million records. These contents are copyright COSMIC (http://cancer.sanger.ac.uk/cosmic/license). LHC has produced a lookup table for these records, and for users to look up particular mutation IDs, both with permission from COSMIC. However, interested parties must contact COSMIC directly for permission to download these records. URI: http://cancer.sanger.ac.uk/cancergenome/projects/cosmic (Not Official)

LRI Code: COSMIC-Smpl, COSMIC-Stru
Cytogenetic (chromosome) location Chromosome location (AKA chromosome locus or cytogenetic location), is the standardized syntax for recording the position of genes and large variants. It consists of three parts: the Chromosome number (e.g. 1-22, X, Y), an indicator of which arm – either “p” for the short or “q” for the long, and then generally a series of numbers separated by dots that indicate the region and any applicable band, sub-band, and sub-sub-band of the locus (e.g. 2p16.3). There are other conventions for reporting ranges and locations at the ends of the chromosomes. The table of these chromosome locations was loaded with all of the locations found in NCBI’s ClinVar variation tables. It will expand as additional sources become available. This does not include all finely grained chromosome locations that exist. Users can add to it as needed. URI: (Not Assigned)

LRI Code: Chrom-Loc
dbSNP The Short Genetic Variations database (dbSNP) is a public-domain archive maintained by NCBI for a broad collection of short genetic polymorphisms. The SNP ID is unique for each position and length of DNA change. For example, a change of three nucleotides will have a different SNP ID than a change of four nucleotides at the same locus, but the code will be the same for all changes at the same locus and with the same length. So, to specify a variation in relation to a reference sequence, both the alt allele and the SNP code must be included. URI: http://www.ncbi.nlm.nih.gov/projects/SNP

LRI Code: dbSNP
dbVar dbVar is NCBI’s database of genomic structural variations (including copy number variants) that are larger than 50 contiguous base pairs. It is the complement of dbSNP, which identifies variants occurring in 50 or fewer contiguous base pairs. dbVar contains insertions, deletions, duplications, inversions, multi-nucleotide substitutions, mobile element insertions, translocations, and complex chromosomal rearrangements. dbVar carries structured Germline and Somatic variants in separate files. Accordingly, we have divided the coding system for dbVar the same way. This coding system represents the Germline dbVar variants. Its record ID may begin with one of four prefixes: nsv, nssv, esv and essv. These are accession prefixes for variant regions (nsv) and variant calls (or instances, nssv), respectively. Typically, one or more variant instances (nssv – variant calls based directly on experimental evidence) are merged into one variant region (nsv – a pair of start-stop coordinates reflecting the submitters’ assertion of the region of the genome that is affected by the variant instances). The “n” preceding sv or indicates that the variants were submitted to NCBI (dbVar). The prefix, “e” for esv and essv represent variant entities (corresponding to NCBI’s nsv and nssv) that were submitted to EBI (DGVa). The relation between variant call, and variant region, instances is many to one. The LHC lookup table for dbVar germline variations includes both variant instances (essv or nssv) and the variant region records (nsv, esv). Users can sub-select by searching on the appropriate prefix. URI: (Not Assigned)

LRI Code: dbVar-GL, dbVar-Som
NCBI Gene Codes Gene supplies gene-specific connections in the nexus of map, sequence, expression, structure, function, citation, and homology data.When applicable, this variable identifies the gene on which the variant is located. The gene identifier is also carried in the transcript reference sequence database, and is part of a full HGVS expression. Not all genes have HGNC names and codes so NCBI has created gene IDs that cover the genes that are not registered by HGNC. For example, SHOX (HGNC:10853) is in the pseudoautosomal region of ChrX and ChrY. NCBI gene symbols are ‘SHOX’ on ChrX and ‘SHOX-2’ on ChrY. URI: (Not Assigned)

LRI Code: NCBI-gene
GL String The GL String Code (GLSC) code system combines Genotype List String (GL String) grammar with established HLA and KIR nomenclatures, thereby specifying a syntax for encoding of nomenclature-level genotyping results. The established HLA and KIR nomenclatures are primarily concerned with identification and characterization of individual alleles, and do not incorporate a grammar for genotyping results. GL String grammar is a string format for genotyping results from genetic systems with defined nomenclatures, such as HLA and KIR. In addition to genotyping results, the grammar can also describe data analysis artifacts such as multi-locus (multi-gene) haplotypes. From a terminology standpoint, the result of adding a compositional grammar to an established code system such as HLA or KIR nomenclature is a new code system, because the grammar may be used to construct expressions not explicitly defined in the original system as concept codes. URI: http://glstring.org (Not Official)

LRI Code: None
HGNC The HUGO Gene Nomenclature Committee (HGNC) table carries the gene ID, gene symbol and full gene name. Gene IDs must be sent as codes and begin with “HGNC:”. Gene symbols should be sent as display. GENE IDs are specific to the species, whereas gene symbols and names are shared by all species with the same gene. The HGNC-Symb table provided by HUGO carries only human genes and is available in a table by LHC. The code for this coding system is the HGNC gene code, the “name” or print string is the HGNC gene symbol. More than 28,000 human gene symbols and names have been assigned so far, including almost all of the protein coding genes. NCBI creates what might be thought of as interim codes. Older systems might send HGNC gene IDs without the “HGNC:” prefix. This is not following the guidance of the creators of the code system, and introduces ambiguity. So caution must be taken to confirm alignment. HGNC also provides an index on gene families/groups. GeneGroup IDs do not begin with “HGNC:”, so care must be made to ensure alignment of concepts when viewing an HGNC ID from an older system that may be referring to the GeneID and not a gene group. For example, 588 refers to the HLA gene family, but HGNC:588 identifies the ATG12 gene. URI: http://www.genenames.org (Gene)
http://www.genenames.org/genegroup (Group)

LRI Code: HGNC-Symb
HGVS The HGVS Nomenclature is an internationally-recognized standard for the description of DNA, RNA, and protein sequence variants. It is used to convey variants in clinical reports and to share variants in publications and databases. The format of the HGVS representation is “reference:description” where the reference is the reference sequence, and the description has the format of “prefix.position(s) change,” where the prefix indicates indicates the type of reference sequence used, and the position(s) change describes the location of the change, and the nucleotide(s) changed. URI: http://varnomen.hgvs.org

LRI Code: HGVS.g, HGVS.p, HGVS.c
HLA Nomenclature Human leukocyte antigen (HLA) is a gene family found in the Major Histocompatibility Complex (MHC) of Chromosome 6 in humans. This family includes more than 50 genes. A subset of these are commonly used as markers for histocompatibility testing for stem cell and solid organ transplantation, drug sensitivity, and disease association. The WHO Nomenclature Committee for Factors of the HLA System is responsible for a common nomenclature of HLA alleles, allele sequences, and quality control, to communicate histocompatibility typing information to match donors and recipients. An HLA allele is defined as any set of variations found on a sequence of DNA comprising a HLA gene. So, if there are five variations found in this one gene sequence, this set is defined as one allele (vs. the definition of an allele being the variation found between the test specimen and the reference along a contiguous stretch of DNA). In the case of HLA, the contiguous stretch of DNA represents the entire gene, and the variations do not need to be contiguous within the gene sequence. Each HLA allele name has a unique name consisting of the gene name followed by up to four fields, each containing at least two digits, separated by colons. There are also optional suffixes added to indicate expression status. Individual HLA genes may have thousands of different alleles. For the full specification, please go to this website: http://hla.alleles.org/nomenclature/naming.html. HLA nomenclature can also be used to represent sets of alleles that share sequence identity in the Antigen Recognition Site (ARS). G-groups are alleles that have identical DNA sequences in the ARS, while P-groups are alleles that have identical protein sequences in the ARS. These are described, respectively, in http://hla.alleles.org/alleles/g_groups.html and http://hla.alleles.org/alleles/p_groups.html. URI: http://www.ebi.ac.uk/ipd/imgt/hla (Not Official)

LRI Code: HLA-Allele
Human Phenotype Ontology The Human Phenotype Ontology (HPO) aims to provide a standardized vocabulary of phenotypic abnormalities encountered in human disease. Each term in the HPO describes a phenotypic abnormality, such as atrial septal defect. URI: http://human-phenotype-ontology.org

LRI Code: HPO
ISCN Like HGVS, The International System for Human Cytogenetic Nomenclature (ISCN) is a syntax. It came out of cytopathology and deals with reporting karyotypes down to the chromosome fusions and many types of small copy number variants. However, cytogenetics is out of the scope in this guide. ISCN syntax is often used to report large deletion-duplications in structural variants and to include other variants that have been observed. URI: https://iscn.karger.com

LRI Code: ISCN
Mondo Disease Ontology The Mondo Disease Ontology is a semi-automatically constructed ontology that merges in multiple disease resources to yield a coherent merged ontology that contains cross-species disease terminology.

Numerous sources for disease definitions and data models currently exist, which include HPO, OMIM, SNOMED CT, ICD, PhenoDB, MedDRA, MedGen, ORDO, DO, GARD, etc; however, these sources partially overlap and sometimes conflict, making it difficult to know definitively how they relate to each other. This has resulted in a proliferation of mappings between disease entries in different resources; however mappings are problematic: collectively, they are expensive to create and maintain. Most importantly, the mappings lack completeness, accuracy, and precision; as a result, mapping calls are often inconsistent between resources. The UMLS provides intermediate concepts through which other resources can be mapped, but these mappings suffer from the same challenges: they are not guaranteed to be one-to-one, especially in areas with evolving disease concepts such as rare disease.
URI: http://purl.obolibrary.org/obo/mondo.owl

LRI Code: MONDO
PHARMVAR The Pharmacogene Variation (PharmVar) Consortium is a central repository for pharmacogene (PGx) variation that focuses on haplotype structure and allelic variation. The major focus of PharmVar is to catalogue allelic variation of genes impacting drug metabolism, disposition and response and provide a unifying designation system (nomenclature) for the global pharmacogenetic/genomic community. Efforts are synchronized between PharmVar, the Pharmacogenomic KnowledgeBase, and the Clinical Pharmacogenetic Implementation Consortium. URI: http://www.pharmvar.org

LRI Code: Star-Allele
RefSeq The NCBI Reference Sequence Database (RefSeq) is a set of reference sequences including genomic, transcript, and protient. RefSeqs include a prefix to describe the type of reference, such as NC_ or NG_. Those prefixed with “NC_” represent the whole genomic RefSeq for individual chromosomes. Those prefixed with “NG_” represent genes with all of their introns and flanking regions and other larger or smaller genomic sequences. These are available separately in the NCBI source data file, which includes all human RefSeqs (including those with prefix of NR_ or XM_): see ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens URI: http://www.ncbi.nlm.nih.gov/refseq

LRI Code: RefSeq-G, RefSeq-P, RefSeq-T
Sequence Ontology Collaborative ontology project for the definition of sequence features used in biological sequence annotation. URI: http://www.sequenceontology.org

LRI Code: None