3

我正在尝试将使用集成samtools到 C 程序中。此应用程序以称为 BAM的二进制格式读取数据,例如stdin

$ cat foo.bam | samtools view -h -
...

(我意识到这是catsamtools.

在 C 程序中,我想将unsigned char字节块写入samtools二进制文件,同时samtools在处理这些字节后捕获标准输出。

因为我不能使用popen()同时写入和读取进程,所以我研究了使用公开可用的实现popen2(),它似乎是为了支持这一点而编写的。

我编写了以下测试代码,它尝试将write()位于同一目录中的 BAM 文件的 4 kB 块字节发送到samtools进程。然后它read()从输出的 s 字节samtools到一个行缓冲区,打印到标准错误:

#include <sys/types.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define READ 0
#define WRITE 1

pid_t popen2(const char *command, int *infp, int *outfp)
{
    int p_stdin[2], p_stdout[2];
    pid_t pid;

    if (pipe(p_stdin) != 0 || pipe(p_stdout) != 0)
        return -1;

    pid = fork();

    if (pid < 0)
        return pid;
    else if (pid == 0)
    {
        close(p_stdin[WRITE]);
        dup2(p_stdin[READ], READ);
        close(p_stdout[READ]);
        dup2(p_stdout[WRITE], WRITE);

        execl("/bin/sh", "sh", "-c", command, NULL);
        perror("execl");
        exit(1);
    }

    if (infp == NULL)
        close(p_stdin[WRITE]);
    else
        *infp = p_stdin[WRITE];

    if (outfp == NULL)
        close(p_stdout[READ]);
    else
        *outfp = p_stdout[READ];

    return pid;
}

int main(int argc, char **argv)
{
    int infp, outfp;

    /* set up samtools to read from stdin */
    if (popen2("samtools view -h -", &infp, &outfp) <= 0) {
        printf("Unable to exec samtools\n");
        exit(1);
    }

    const char *fn = "foo.bam";
    FILE *fp = NULL;
    fp = fopen(fn, "r");
    if (!fp)
        exit(-1);
    unsigned char buf[4096];
    char line_buf[65536] = {0};
    while(1) {
        size_t n_bytes = fread(buf, sizeof(buf[0]), sizeof(buf), fp);
        fprintf(stderr, "read\t-> %08zu bytes from fp\n", n_bytes);
        write(infp, buf, n_bytes);
        fprintf(stderr, "wrote\t-> %08zu bytes to samtools process\n", n_bytes);
        read(outfp, line_buf, sizeof(line_buf));
        fprintf(stderr, "output\t-> \n%s\n", line_buf);
        memset(line_buf, '\0', sizeof(line_buf));
        if (feof(fp) || ferror(fp)) {
            break;
        }
    }
    return 0;
}

(对于 的本地副本foo.bam,这里是我用于测试的二进制文件的链接。但任何 BAM 文件都可以用于测试目的。)

编译:

$ cc -Wall test_bam.c -o test_bam

问题是程序在write()调用后挂起:

$ ./test_bam
read    -> 00004096 bytes from fp
wrote   -> 00004096 bytes to samtools process
[bam_header_read] EOF marker is absent. The input is probably truncated.

如果close()在调用infp后立即调用变量write(),则循环在挂起之前再进行一次迭代:

...
write(infp, buf, n_bytes);
close(infp); /* <---------- added after the write() call */
fprintf(stderr, "wrote\t-> %08zu bytes to samtools process\n", n_bytes);
...

随着close()声明:

$ ./test_bam
read    -> 00004096 bytes from fp
wrote   -> 00004096 bytes to samtools process
[bam_header_read] EOF marker is absent. The input is probably truncated.
[main_samview] truncated file.
output  -> 
@HD VN:1.0 SO:coordinate
@SQ SN:seq1 LN:5000
@SQ SN:seq2 LN:5000
@CO Example of SAM/BAM file format.

read    -> 00004096 bytes from fp
wrote   -> 00004096 bytes to samtools process

通过此更改,如果我在命令行上运行,我会得到一些我希望得到的输出samtools,但如前所述,该过程再次挂起。

如何popen2()将数据以块的形式写入和读取到内部缓冲区?如果这是不可能的,是否有替代方案可以popen2()更好地完成这项任务?

4

2 回答 2

1

作为 a 的替代方案pipe,为什么不samtools通过 a进行通信socket?检查samtools源,该文件knetfile.c表明samtools有可用的套接字通信:

#include "knetfile.h"

/* In winsock.h, the type of a socket is SOCKET, which is: "typedef
* u_int SOCKET". An invalid SOCKET is: "(SOCKET)(~0)", or signed
* integer -1. In knetfile.c, I use "int" for socket type
* throughout. This should be improved to avoid confusion.
*
* In Linux/Mac, recv() and read() do almost the same thing. You can see
* in the header file that netread() is simply an alias of read(). In
* Windows, however, they are different and using recv() is mandatory.
*/

这可能提供比使用pipe2.

于 2014-06-19T22:58:37.943 回答
-2

这个问题与具体的实现无关popen2。另请注意,在 OS X 上,popen允许您打开双向管道,在其他 BSD 系统上也可能如此。如果这是可移植的,则需要配置检查是否popen允许双向管道(或等效于配置检查的东西)。

您需要将管道切换到非阻塞模式,并在无限循环中交替read调用write。这样的循环,为了在samtools进程繁忙时不浪费 CPU,需要使用selectpoll或类似的机制来阻止文件描述符变为“可用”(更多数据要读取,或准备好接受数据以进行写入)。

请参阅this question以获得一些灵感。

于 2014-06-19T22:39:46.690 回答