I am using htslib for extracting all the information contained in a VCF file in C++.
Currently, thanks to the VCF specification and the documentation in the file vcf.h, I have successfully extracted all the metadata information in the header (Meta-Information Lines), and most of the information contained in each row of the body of the file (Data Lines).
However, I don't know how to extract the genotype information (sample columns).
I am using example files from the 1000G project. This is an example of two rows of the file, it shows the Format field and two samples (The file has more than 1000 samples per each row, I would like to extract the data for all of them):
FORMAT HG00096 HG00097
GT:DS:GL 0|0:0.050:-0.48,-0.48,-0.48 0|0:0.050:-0.24,-0.40,-1.49
GT:DS:GL 0|0:0.000:-0.10,-0.69,-4.70 0|0:0.000:-0.05,-0.94,-5.00
I know that this is a heavy task that would take some computation time. I have extracted the names of each column (HG00096, HG00077...), but I don't know how to extract the information of each sample either as a full string (e.g., "0|0:0.050:-0.48,-0.48,-0.48"), as a set (array, map, vector...) of key-value pairs (e.g., [("GT", "0|0"), ("DS", "0.050"), ("GL", "-0.48,-0.48,-0.48")), or simply as an array of values (e.g., ["0|0", "0.050", "-0.48,-0.48,-0.48"]. I would like to do this for each sample.
I have been reading the documentation in the vcf.h file and I think that the function bcf_get_genotypes(hdr,line,dst,ndst) may be suitable for this, but I don't know for sure how to use it for extracting the values as strings. Also, I think that this information may be stored inside the 'p' pointer of 'bcf_fmt_t', but I don't know for sure, it just contains an array set of uint8_t values and I don't know if a string (or char array) can be extracted in the way I want.
typedef struct bcf_fmt_t {
int id;
int n, size, type;
uint8_t *p;
uint32_t p_len;
uint32_t p_off:31, p_free:1;
} bcf_fmt_t;
Is there a way of doing that I am trying to do?