python - 将“字节跨度”匹配到文本文档，Python

Question

我正在使用包含两组 .txt 文件的带注释的语料库。第一组包含被注释的文档（即文章、博客文章等），第二组包含实际的注释。将注释与注释文本匹配的方法是通过“字节跨度”。从自述文件中：

"The span is the starting and ending byte of the annotation in 
the document.  For example, the annotation listed above is from 
the document, temp_fbis/20.20.10-3414.  The span of this annotation 
is 730,740.  This means that the start of this annotation is 
byte 730 in the file docs/temp_fbis/20.20.10-3414, and byte 740 
is the character after the last character of the annotation."

所以，问题：如何索引文档中的开始和结束字节，以便我可以将注释与原始文档中的文本相匹配？有任何想法吗？我正在用 Python 做这方面的工作......

score 0 · Accepted Answer

"This means that the start of this annotation is 
byte 730 in the file docs/temp_fbis/20.20.10-3414, and byte 740 
is the character after the last character of the annotation.

     blah, blah, blah, example annotation, blah, blah, blah
                       |                 |
                  start byte          end byte

The data_type of all annotations should be 'string'."

score 0 · Accepted Answer

#open, seek, read
start, end = 730,740
f = open("myfile", "rb")
try:
    f.seek(start)
    while start > end
        byte = f.read(1)
        # Do stuff with byte.
        start -= 1
finally:
    f.close()

python - 将“字节跨度”匹配到文本文档，Python

2 回答 2

Related

Reference