0

我正在用 C 编写一个程序来从 FASTA 文件中读取文本,并且对于每个名称(例如 >COTV-SPAn232-096)我希望我的程序能够识别“>”,然后在 \n 之前使用以下文本用作变量的名称。

对变量进行硬编码的问题在于,该程序需要尽可能动态,因为它可以读取任意数量的不同数据集。例如,我的测试集有 15 个不同的序列,看起来像这样:

COTV-SPAn232-096 MKILNSYNDFIISFINFILFPTIQNVSISKLNILGYILSFIRIISISMDFDILKFSNIIQDYGLIFPDDIKKIQNEKFLVLERGLSGKLYAIHIYDFMARFDNETIFGIAKFLYRNNTKILDVLFINKDLFDKTDILYPKSTITLSSYSDEYIDYTYKTIKLIFLNLFNSFRFSKIDSKLSYLYLPLRKDINNVIL

该计划以序列的名称读取,将该名称设置为动态数组的变量,并使用 malloc/realloc 来处理存储实际序列,以便以后比较所有不同的序列。我可以处理除变量变量名之外的所有内容。

简单地四处寻找答案,似乎它不能在 C 中完成,但可以在 python 和其他一些语言中完成。我真的希望这不是实际情况,但如果是这样的话,有没有人有其他建议来处理这个问题?是的,这是生物信息学,我可能应该使用 python、perl、java 或其他语言,但我宁愿继续在 C 中解决这个问题,以便进一步精通 C。

提前感谢我可能收到的任何答案!

4

3 回答 3

6

这在 C 中是不可能的,但从来没有理由创建具有动态名称的变量(事实上,即使您在 C 中创建了这样的变量,您将如何使用它们?)

相反,使用哈希表- 这是一种从键映射到值的数据结构。在您的情况下,您希望它从字符串(您的序列名称)映射到字符串(您的序列)。

在线哈希表的 C 库示例比比皆是:这个StackOverflow 问题提供了一些。

于 2013-01-22T01:24:34.143 回答
2

The plan is read in the name of the sequence, set that name as the variable for a dynamic array and use malloc/realloc to handle storing the actual sequence for a later comparison of all the different sequences. I can handle everything except the variable variable names.

Instead of naming the variable with the sequence header/name, create a struct that holds the sequence header/name and the sequence, e.g.:

typedef struct {
    char *header;
    char *sequence;
} fasta_t;

Then create a list of fasta_t pointers ("pointer to pointers"):

fasta_t **fasta_elements = NULL;

Use malloc() to allocate space for N elements of type fasta_t *, e.g.:

fasta_elements = malloc(N * sizeof(fasta_t *));

It's a good idea to check if you actually got the memory you asked for:

if (!fasta_elements) {
    /* i.e., if fasta_elements is still NULL */
    fprintf(stderr, "ERROR: Could not allocate space for FASTA element list!\n");
    return EXIT_FAILURE;
}

(You should get into the habit of doing this with every pointer you malloc(), in my opinion.)

Now that space has been allocated, read in N elements (use realloc() if we need to make the list bigger, but let's assume N elements for now). Within a loop, allocate space for an individual fasta_t pointer, as well as space for header and sequence char *s within the fasta_t pointer:

#define MAX_HEADER_LENGTH 256
#define MAX_SEQUENCE_LENGTH 4096

/* ... */

size_t idx;
char current_header[MAX_HEADER_LENGTH] = {0};
char current_sequence[MAX_SEQUENCE_LENGTH] = {0};

for (idx = 0U; idx < N; idx++) 
{
    /* set up space for the fasta_t struct members (the header and sequence pointers) */
    fasta_elements[idx] = malloc(sizeof(fasta_t));

    /* parse current_header and current_sequence out of FASTA input */
    /* ... */

    /* validate input -- does current_header start with a '>' character, for instance? */
    /* data in bioinformatics is messy -- validate input where you can */

    /* set up space for the header and sequence pointers */
    /* sizeof(char) is redundant in C, because sizeof(char) is always 1, but I'm putting it here for completeness */
    fasta_elements[idx]->header = malloc((strlen(current_header) + 1) * sizeof(char)); 
    fasta_elements[idx]->sequence = malloc((strlen(current_sequence) + 1) * sizeof(char));

    /* copy each string to the list pointer, for which we just allocated space */
    strncpy(fasta_elements[idx]->header, current_header, strlen(current_header) + 1);
    strncpy(fasta_elements[idx]->sequence, current_sequence, strlen(current_sequence) + 1);
}

To print out the i+1'th element's header, for example:

fprintf(stdout, "%s\n", fasta_elements[i]->header);

(Remember that indexing is 0-based in C — the 10th element has index 9, for instance.)

When finished, be sure to free() individual pointers within a fasta_t * pointer, the fasta_t * pointer itself, and then the fasta_t ** pointer to pointers:

for (idx = 0U; idx < N; idx++) 
{
    free(fasta_elements[i]->header), fasta_elements[i]->header = NULL;
    free(fasta_elements[i]->sequence), fasta_elements[i]->sequence = NULL;
    free(fasta_elements[i]), fasta_elements[i] = NULL;
}
free(fasta_elements), fasta_elements = NULL;

For convenience, once you get the hang of dealing with structs and memory management, you'll want to write wrapper functions that set up, access, edit and break down a fasta_t * element, as well as wrapper functions that do the same for a list of fasta_t * elements.

于 2013-01-22T01:29:48.977 回答
2

坏消息是它不能在 C 中完成,因为 C 变量是一个编译时概念。变量充当包含数据的内存区域的“标签”;一旦编译器完成,大多数变量的名称都会被丢弃。它们可能会被写入调试器的单独文件,但这对人类来说很方便。

好消息是您不需要将新变量命名为新名称。您所需要的只是包含名称的第二个变量。一对变量 - 一个用于 thename和一个用于 thevalue就是您所需要的。

于 2013-01-22T01:26:22.233 回答