c - 使用 C 提取 Wiki 链接

Question

我需要编写一个程序来读取维基百科源文件并提取所有指向其他网页的链接。所有网页看起来都像示例：

<a href="/wiki/PageName" title="PageName">Chicken</a>

我基本上需要将 /wiki/ 之后的 PageName 与标题匹配，如果它们相同，如上所述，则只在终端上显示 PageName。

但是，不应匹配以下内容，因为它的格式与上述不同：（ <a href="http://chicken.com>Chicken</a>这是指向维基百科外普通网站的链接） <a href="/wiki/Chicken >Chicken</a>（缺少 title= 部分）我试图实现的输出看起来像这样：

我试图实现的示例输出

我已经为此工作了很长时间，并且能够做到以下几点：

#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[])
{
  FILE * file;
  file = fopen(argv[1], "r");

  char line[512];
  char* search;

  while(!feof(file)){
    fgets(line,512,file);

    search = strstr( line, "<a href=\"/wiki/");

    if(search != NULL){
        puts(search);
    }
  }
}

代码只过滤到 /wiki/ 但从这里开始我是空白的。我尝试了很多搜索，但无法获得线索。帮助将不胜感激。

score 2 · Accepted Answer

而不是while(!feof(file))您可以使用while(fgets(line,512,file))并通过添加几个验证，您的最终代码与预期输出将看起来像，

#ifdef  _MSC_VER
#define _CRT_SECURE_NO_WARNINGS
#endif //  MSC

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
    FILE * file;

    if (argc != 2)
    {
        return -1;
    }

    file = fopen(argv[1], "r");

    if (!file)
    {
        return -1;
    }
    char line[512];
    char* search;

    while (fgets(line, 512, file)) {
        search = strstr(line, "<a href=\"/wiki/");

        if (search != NULL) {
            char *title = _strdup(search);
            if (title)
            {
                char* start = strstr(title, ">");
                char* end = strstr(start, "<");
                if (end)
                {
                    *end = 0;
                }
                if (strlen(start) >= 2)
                {
                    puts(start + 1);
                }
                free(title);
                title = 0;
            }
        }
    }
    fclose(file);
    file = NULL;
    return 0;
}

score 1 · Accepted Answer

size_t sz;
fseek(file, 0L , SEEK_END);
sz=ftell(file);
rewind(file);
char line[sz+1];

这可能会修复分段错误。

c - 使用 C 提取 Wiki 链接

2 回答 2

Related

Reference