VCF API Documentation

Exception Types

class FileReadError : public std::runtime_error

Exception thrown when there is a problem reading the underlying file.

class MalformedFile : public std::runtime_error

Exception thrown when the VCF file is not well formed according to the spec.

class ApiMisuse : public std::runtime_error

Exception thrown when the API is misused (bad arguments, using iterators incorrectly).

Classes and Methods for VCF Parsing

class VCFFile

A lazy parser for VCF files.

Loads the data into memory a line at a time. Does not parse the entire line, as users often only need certain pieces of information.

Public Functions

inline bool isUsingIndex() const

Is this VCF file using an index for fast random access?

Returns:

true if an index is being used.

inline bool lowerBoundPosition(size_t position)

Seek to the first variant greater than or equal to the given position. Slow if using a non-indexed VCF(.GZ), fast if you have a tabix index.

The Tabix index uses only the linear index, not the binning index. The latter is more complex, and we’re not interested in ranges so much as we are interested in a single position.

inline bool getContig(const std::string &contig, size_t &length, std::string &assembly)

Get information for the contig with the given name.

Parameters:
  • contig[in] The contig name, as found in the VCF header.

  • length[out] Returns the length found in the VCF header, if the function return value was true.

  • assembly[out] Returns the assembly name found in the VCF header, if the function return value was true.

Returns:

true if the contig was found, false otherwise.

inline size_t numVariants()

Compute the number of variants in the file: expensive! This is not a constant time operation, it involves scanning the entire file. It is faster for indexed files, as only the index is scanned.

Returns:

The number of variants.

inline RangePair getGenomeRange()

Compute the range of positions for variants in the file. If the VCF metadata does not contain contigs then this is not a constant time operation, it involves scanning the entire file. Otherwise the sequence length in the contig header is used.

Returns:

A pair of the minimum and maximum variant positions present in the file.

inline size_t numIndividuals()

Get the number of individuals with labels in the VCF file.

Returns:

number of individuals.

inline std::vector<std::string> &getIndividualLabels()

Get a list of the labels for the individuals in the VCF file.

Returns:

vector of strings, where the 0th is the 0th individuals label, etc.

inline std::vector<std::string> getAllMetaInfo(const char *const key) const

Get all metadata values for a given key, from the VCF header rows.

Parameters:

key[in] The metadata key name.

Returns:

the string associated with the given key, or empty string.

inline std::string getMetaInfo(const char *const key) const

Get a single metadata value for a given key, from the VCF header rows. Fails if there is more than one value for the key.

Parameters:

key[in] The metadata key name.

Throws:

MalformedFile – exception thrown if the key does

Returns:

the string associated with the given key, or empty string.

inline FileOffset getFilePosition()

Get an opaque handle describing the current file position of the parser.

Returns:

Current FileOffset.

inline void setFilePosition(const FileOffset &position)

Use an opaque handle to return to a previously-recorded file position.

Parameters:

position[in] A FileOffset saved via getFilePosition().

inline void seekBeforeVariants()

Change the parser position to be immediately before the first variant.

inline bool hasNextVariant () PICOVCF_DEPRECATED

DEPRECATED: just use nextVariant() which returns a boolean telling you whether there was another variant to retrieve.

Returns:

true if calling nextVariant() will place us at a valid variant.

inline bool nextVariant()

Read the variant at the current file position and move the file position to the following variant.

Returns:

true if there was another variant, false otherwise.

inline VCFVariantView &currentVariant()

Get a parseable view of the variant that we last encountered with nextVariant().

Returns:

A VCFVariantView that can be queried for variant information.

class VCFVariantView

A class that lazily interprets the data for a single variant.

This does not “parse” the whole row associated with the variant, it locates the positions of required fields and then parses those fields on demand. The individual genotype data is accessed through another lazy view (iterator).

Public Functions

inline VCFVariantInfo parseToVariantInfo(bool basicInfoOnly = true) const

Parse and make a copy of the non-genotype data in this variant row. This can be expensive, especially without basicInfoOnly set, but does allow you to capture the information from this view and not lose it when you move to the next variant.

Parameters:

basicInfoOnly[in] True by default, this only populates the chromosome, position, id, ref allele, and alt allele fields. Set to false to get the additional fields.

inline const std::vector<bool> &getIsPhased() const

Return a vector of size numIndividuals, where each value is true if that individual is phased for the current variant, and false otherwise.

inline VCFPhasedness getPhasedness() const

Return an enum indicating the overall phasedness of the variant. Can be PVCFP_PHASED, PVCFP_UNPHASED, or PVCFP_MIXED. Useful for not having to check the phased of every individual on every variant, if you expect a certain kind of data.

inline AlleleT getMaxPloidy() const

The maximum ploidy seen for any individual in this variant.

inline std::string getChrom() const

Get the chromosome identifier.

Returns:

String of the chromosome identifier.

inline size_t getPosition() const

Get the genome position.

Returns:

Integer of the genome position.

inline std::string getID() const

Get the ID for this variant.

Returns:

String of the ID.

inline std::string getRefAllele() const

Get the reference allele for this variant.

Returns:

String of the reference allele.

inline std::vector<std::string> getAltAlleles() const

Get the alternative alleles for this variant.

Returns:

Vector of strings for all the alternative alleles. The allele index associated with each individual can be used to lookup the actual allele in this vector.

inline double getQuality() const

Get the quality value.

Returns:

The numeric value for the quality.

inline std::string getFilter() const

Get the filter value.

Returns:

The string value for the filter.

inline std::unordered_map<std::string, std::string> getInfo() const

Get the info key/value pairs.

Returns:

The information as a map from key to value.

inline bool hasGenotypeData() const
Returns:

true if this VCF file contains genotype data.

inline std::vector<std::string> getFormat() const

Gets the list of formats.

Enforces that “GT” must be the first FORMAT. If the resulting vector is empty then there is no FORMAT and thus there is no genotype data.

Returns:

A vector the format strings.

inline std::vector<AlleleT> getGenotypeArray()

Get an array (std::vector) of allele values per haploid, where the alleles for an individual are grouped together (consecutive) based on maxPloidy. The maxPloidy can be retrieved via getMaxPloidy(), and can differ for each Variant, but is the same within a variant. When an individual has missing data for an allele, the value is picovcf::MISSING_DATA, and when their ploidy is less than maxPloidy, the remaining alleles are filled in with picovcf::MIXED_PLOIDY.

Returns:

A std::vector of allele values. Size will always be maxPloidy * numIndividuals.

inline IndividualIteratorGT getIndividualIterator () const PICOVCF_DEPRECATED

Get an iterator for traversing over the individual genotype data.

Returns:

An IndividualIteratorGT for efficiently accessing the genotype data.

Public Static Attributes

static constexpr const char *const FORMAT_GT = "GT"

Genotype format string identifier.

Warning

doxygenclass: Cannot find class “picovcf::TabixIndex” in doxygen xml output for project “picovcf” from directory: doc/xml/

Warning

doxygenclass: Cannot find class “picovcf::TabixIndexSequence” in doxygen xml output for project “picovcf” from directory: doc/xml/

inline std::map<std::string, std::string> picovcf::picovcf_parse_structured_meta(const std::string &metaValue)

Parse a structured metadata value, like INFO=<key=value,key=”value”>

Parameters:

metaValue[in] The string to parse, which is the value from a getMetaInfo() map.

Throws:

MalformedFile – is thrown on parsing errors.

Returns:

A map from key to value for the parsed string.

Structs

struct VCFVariantInfo

Pre-parsed information from a Variant, minus any individual data.

Public Members

std::string chromosome
size_t position
std::string identifier
std::string referenceAllele
std::vector<std::string> alternativeAlleles
double quality
std::string filter
std::unordered_map<std::string, std::string> information
std::vector<std::string> format