IGD API Documentation

Classes for IGD Handling

class IGDData

Indexable individual genotype data.

This class is for reading data from an IGD file. See IGDWriter for generating IGD output.

Public Functions

inline explicit IGDData(const std::string &filename)

Loads the header of an IGD file and prepares for accessing the variant and genotype data. This class provides access on a per-variant level, it does not load the entire data into memory.

Parameters:

filename[in] The file path for an IGD file.

inline uint64_t numVariants() const

The number of variants in this file. Note that variants are binary in IGD, so this is not necessarily the same as the number of polymorphic sites (i.e., rows in a VCF file). Also for unphased data, each number of copies that is stored is considered a variant, so e.g., for diploid datasets Aa and AA are considered separate variants.

Returns:

Number of variants (rows of genotype data).

inline SampleT numIndividuals() const

The number of individuals represented in the genotype data.

Returns:

Number of individuals.

inline SampleT numSamples() const

IGD files have a fixed ploidy, so this is either getPloidy()*numIndividuals() (for phased data) or the same as numIndividuals() (for unphased).

Returns:

Number of samples. Every sample index will be >= 0 and < numSamples().

inline std::string getSource() const

A string describing where this file came from.

Returns:

A string description.

inline std::string getDescription() const

Free-form description of the file contents.

Returns:

A string description.

inline uint64_t getPloidy() const

The ploidy of all the data contained in the file.

Returns:

ploidy value >= 1.

inline bool isPhased() const

The phasedness of all the data in the IGD file.

Returns:

true if phased.

inline RangePair getGenomeRange()

Get the range of positions for variants in the file.

Returns:

A pair of the minimum and maximum variant positions present in the file.

inline uint64_t getPosition(VariantT variantIndex, bool &isMissing, uint8_t &numCopies)

Get the position of the given variant.

Parameters:
  • variantIndex[in] The 0-based index of the variant, i.e. the row number.

  • isMissing[out] Will be set to true if this variant represents missing data.

  • numCopies[out] Will be set to the number of copies of the alternate allele that this variant represents.

Returns:

The position of the variant on the genome.

inline uint64_t getPosition(VariantT variantIndex, bool &isMissing)

Get the position of the given variant.

Parameters:
  • variantIndex[in] The 0-based index of the variant, i.e. the row number.

  • isMissing[out] Will be set to true if this variant represents missing data.

Returns:

The position of the variant on the genome.

inline uint64_t getPosition(VariantT variantIndex)

Get the position of the given variant.

Parameters:

variantIndex[in] The 0-based index of the variant, i.e. the row number.

Returns:

The position of the variant on the genome.

inline std::string getAltAllele(VariantT variantIndex)

Get the single alternative allele for the given variant.

Parameters:

variantIndex[in] The 0-based index of the variant, i.e. the row number.

Returns:

The string representing the alternative allele of the variant.

inline std::string getRefAllele(VariantT variantIndex)

Get the reference allele for the given variant.

Parameters:

variantIndex[in] The 0-based index of the variant, i.e. the row number.

Returns:

The string representing the reference allele of the variant.

inline IGDSampleList getSamplesWithAlt(VariantT variantIndex)

Get the list of samples that have the alternate allele for the given variant.

Parameters:

variantIndex[in] The 0-based index of the variant, i.e. the row number.

Returns:

A list (std::vector) of the sample indexes. Order is based on individual, and then ploidy within the individual. E.g., the 0th diploid individual will have sample indexes 0 and 1, the 1st will have 2 and 3, etc.

inline std::vector<std::string> getIndividualIds()

Read the (optional) list of individual identifiers from the file.

Returns:

A newly created std::vector<std::string> object. Individual 0 has its label at position 0, and the last individual has its label at position (numIndividuals()-1).

inline bool hasVariantIds() const
Returns:

true if this file has identifiers for variants.

inline std::vector<std::string> getVariantIds()

Read the (optional) list of variant identifiers from the file.

Returns:

A newly created std::vector<std::string> object. Variant 0 has its label at position 0, and the last variant has its identifier at position (numVariants()-1).

inline size_t lowerBoundPosition(const size_t position)

Return the first variant index with position that is greater than or equal to the given position. Will return numVariants() if the given position is greater than all positions in the IGD.

Parameters:

position[in] The base-pair position to search for.

Returns:

The first variant index with position greater-than-or-equal-to the given position.

Public Static Attributes

static constexpr uint64_t IGD_MAGIC = 0x3a0c6fd7945a3481

Uniquely identifies this file as an IGD

static constexpr uint64_t IGD_PHASED = 0x1

Flag that indicates the genotype data is phased

static constexpr uint64_t CURRENT_IGD_VERSION = 4

The IGD file format version that this library writes

static constexpr uint32_t DEFAULT_SPARSE_THRESHOLD = 32

If fewer than NumSamples/DEFAULT_SPARSE_THRESHOLD samples have a particular variant, we will store it sparsely.

struct FixedHeader
struct IndexEntry
class IGDWriter

Class for constructing IGD files on disk.

Public Functions

inline IGDWriter(uint32_t ploidy, uint32_t numIndividuals, bool isPhased)

Create an IGDWriter object, does not create the file.

Parameters:
  • ploidy[in] The fixed ploidy for all individuals.

  • numIndividuals[in] The number of individuals with genotype data.

  • isPhased[in] True if the data is phased, false otherwise.

inline void writeHeader(std::ostream &outStream, const std::string &source, const std::string &description)

Write the fixed-size IGD header to the output stream.

Parameters:
  • outStream[in] The output stream.

  • source[in] A description of where this data came from.

  • description[in] Generic description of the data.

inline void writeVariantSamples(std::ostream &outStream, const uint64_t genomePosition, const std::string &referenceAllele, const std::string &altAllele, const IGDSampleList &sampleList, const bool isMissing = false, const uint8_t numCopies = 0)

Write a single variant (row) of sample/genotype data.

This can be used for phased or unphased data: when phased the sample list always consists of haploid sample indexes, when unphased the sample list consists of individual indexes. For the former, indexes are 0 <= index < (ploidy*numIndividuals) and for the latter 0 <= index < numIndividuals.

Parameters:
  • outStream[in] The output stream.

  • genomePosition[in] The position of the variant on the genome.

  • referenceAllele[in] The string representing the reference allele.

  • altAlleles[in] The vector of strings representing the alternate alleles. Must be at least 1 alternate allele.

  • sampleList[in] A list of sample indexes. For phased diploid data, e.g., index 0 is the 0th chromosome copy of the 0th individual, index 1 is the 1st chromosome copy of the 0th individual, index 2 is the 0th chromosome copy of the 1st individual, etc. For unphased data, index 0 is the 0th individual, index 1 is the 1st individual, etc.

  • numCopes[in] [Optional] Required only for unphased data. The number of copies of the alternate allele this variant represents, 0 < copies <= ploidy.

  • isMissing[in] [Optional] when set to true, this row represents all the samples that have missing data at the given polymorphic site.

inline void writeIndex(std::ostream &outStream)

Write the table of information about the variants. This information is collected and saved by writeVariantSamples(), so this function must be called after that one.

Parameters:

outStream[in] The output stream.

inline void writeVariantInfo(std::ostream &outStream)

Write the table of information about the variants. This information is collected and saved by writeVariantSamples(), so this function must be called after that one.

Parameters:

outStream[in] The output stream.

inline void writeIndividualIds(std::ostream &outStream, const std::vector<std::string> &labels)

Write the table of individual identifiers.

This is an optional part of the IGD file. You can create an IGD and not call this method, in which case the individual ID table will just be empty.

Parameters:
  • outStream[in] The output stream.

  • labels[in] A list (vector) of string identifiers. One id per individual.

inline void writeVariantIds(std::ostream &outStream, const std::vector<std::string> &labels)

Write the table of variant identifiers.

This is an optional part of the IGD file. You can create an IGD and not call this method, in which case the variant ID table will just be empty.

Parameters:
  • outStream[in] The output stream.

  • labels[in] A list (vector) of string identifiers. One id per variant.

VCF to IGD conversion

inline void picovcf::vcfToIGD(const std::string &vcfFilename, const std::string &outFilename, std::string description = "", bool verbose = false, bool emitIndividualIds = false, bool emitVariantIds = false, bool forceUnphased = false, const size_t forceToPloidy = 0, bool dropUnphased = false, void (*variantCallback)(const VCFFile&, VCFVariantView&, void*) = nullptr, void *callbackContext = nullptr, std::string contig = PVCF_VCFFILE_CONTIG_ALL, const std::pair<size_t, size_t> region = {INVALID_POSITION, INVALID_POSITION})

Using minimal memory, convert the given VCF file (can be gzipped) to an IGD file with the given name.

Parameters:
  • vcfFilename[in] The name of the input VCF file to be converted.

  • outFilename[in] The name of the output IGD file to be created.

  • description[in] [Optional] A description of the dataset.

  • verbose[in] [Optional] Set to true to get statistics printed to stdout.

  • emitIndividualIds[in] [Optional] Copy individual IDs to IGD file (false by default).

  • emitVariantIds[in] [Optional] Copy variant IDs to IGD file (false by default).

  • forceUnphased[in] [Optional] When true, force the result to be unphased even if the input is phased (or mixed phased-ness).

  • variantCallback[in] [Optional] When non-null, invoke this callback on every variant (row) that is emitted to the IGD file. The arguments to the callback are (const VCFFile&, const VCFVariantView& variant, void* context), where the variant view and VCF file can be used to get metadata, and context is a user-provided pointer (see callbackContext).

  • callbackContext[in] [Optional] Pointer to an object that will be passed to variantCallback.

  • contig[in] [Optional] Select which contig(s) to use when converting from VCF to IGD. IGD does not support contigs, so everything will be “merged” into a single contig in the IGD file. Use PVCF_VCFFILE_CONTIG_REQUIRE_ONE if you only want to convert VCF files that have a single CONTIG. See also PVCF_VCFFILE_CONTIG_ALL (default) and PVCF_VCFFILE_CONTIG_FIRST. Otherwise, takes a free-form string value that should match the contig to be converted.

  • region[in] [Optional] Only convert variants found within this range, which is a pair (start, end) where each position is inclusive.

IGD merging

inline void picovcf::mergeIGDs(std::ostream &outputStream, const std::vector<std::string> &inputFilenames, std::string description = {})

Merge the IGD files given by a list of IGDReader into a single output stream. The input files must be mutually exclusive by genome range, such that if one input covers variants over the range (R1, R2) and another covers (R3, R4) then either R1 >= R4 or R3 >= R2. The samples described by the inputs must also be identical.

If only some input IGDs have variant IDs, the remaining variants will get the empty string for their identifiers. The individual IDs will be used from the first (ascending order genetic position) input IGD and will not be checked against other IGD files individual IDs.

Parameters:
  • out_file – The filename to write the output IGD to.

  • in_readers – List of IGDReader objects for the input IGDs to be merged.

  • force_overwrite – Optional. Set to True to always write the output file, even if it already exists.

  • description – Optional. Description to write to the IGD header. If not specified (None) then the description of the first input IGD will be used.