python - 在 Python 中连续解析文件

Question

我正在编写一个脚本来解析一个带有 HTTP 流量行的文件，并取出域，目前只是将它们打印到屏幕上。我正在使用 httpry 将流量连续写入文件。这是我用来去除域名的脚本

#!/usr/bin/python

import re

input = open("results.txt","r")

for line in input:
    domain = line.split()[6]
    if domain != "-":
        print domain

虽然这个脚本工作得很好，但我想要一种方法来持续运行这个脚本，这样当新的流量被添加到输入文件时，脚本就能够把它去掉。我不能只在 httpry 的输出上运行 awk，因为我最终会将这些域输入到 Mongo 数据库中，而且我也需要脚本来执行此操作。如果有人能给我一些想法，如何在输出上不断地运行这个 python 脚本，而不是重新打印以前的条目，那将不胜感激。谢谢。

score 6 · Accepted Answer

尝试在http://code.activestate.com/recipes/157035-tail-f-in-python/tail -f中找到的这个实现

import time

while 1:
    where = file.tell()
    line = file.readline()
    if not line:
        time.sleep(1)
        file.seek(where)
    else:
        print line, # already has newline

score 0 · Accepted Answer

Node.js 有一个不错的readline模块，应该可以很好地处理这个问题：

var readline = require('readline')
  , fs = require('fs')

var input = process.stdin; // or: fs.createReadStream('input.txt');
var output = process.stdout; // or: fs.createWriteStream('output.txt')

var reader = readline.createInterface({
  input: input,
  output: output
});

reader.on('line', function(line) {
  this.write(line.split(/[ ]+/)[6]);
});

将其保存在 .js 文件中并执行node domains.js，或您命名的任何内容。或cat file | node domains.js。

将来它也应该与 mongodb 很好地集成:)

python - 在 Python 中连续解析文件

2 回答 2

Related

Reference