1

I have created a python script that automates a workflow converting PDF to txt files. I want to be able to store and query these files in MongoDB. Do I need to turn the .txt file into JSON/BSON? Should I be using a program like PyMongo?

I am just not sure what the steps of such a project would be let alone the tools that would help with this.

I've looked at this post: How can one add text files in Mongodb?, which makes me think I need to convert the file to a JSON file, and possibly integrate GridFS?

4

2 回答 2

5

如果您使用驱动程序,则不需要对其进行 JSON/BSON 编码。如果您使用的是 MongoDB shell,则在粘贴内容时需要担心它。

您可能希望使用Python MongoDB 驱动程序

from pymongo import MongoClient

client = MongoClient()
db = client.test_database  # use a database called "test_database"
collection = db.files   # and inside that DB, a collection called "files"

f = open('test_file_name.txt')  # open a file
text = f.read()    # read the entire contents, should be UTF-8 text

# build a document to be inserted
text_file_doc = {"file_name": "test_file_name.txt", "contents" : text }
# insert the contents into the "file" collection
collection.insert(text_file_doc)

(未经测试的代码)

如果您确保文件名是唯一的,您可以设置_id文档的属性并像这样检索它:

text_file_doc = collection.find_one({"_id": "test_file_name.txt"})

或者,您可以确保file_name如上所示的属性已编入索引并执行以下操作:

text_file_doc = collection.find_one({"file_name": "test_file_name.txt"})

您的另一个选择是使用 GridFS,尽管通常不建议将它用于小文件。

这里有一个Python 和 GridFS 的入门指南。

于 2013-04-30T20:44:12.840 回答
0

是的,您必须将文件转换为 JSON。有一种简单的方法可以做到这一点:使用类似{"text": "your text"}. 以后扩展/更新此类记录很容易。

当然,您需要逃避"文本中的出现。我想您使用您最喜欢的语言的 JSON 库和/或 MongoDB 库来进行所有格式化。

于 2013-04-30T19:50:47.770 回答