ruby-on-rails - 什么是合适的字符串来表示数字人文转录中难以辨认的数据？

Question

我正在构建一个数字人文应用程序，我们有一堆数字化的历史文件，学生们将抄录文本。这是架构...

  create_table "documents", force: true do |t|
    t.string   "document_name"
    t.date     "date_filed"
    t.string   "grantor"
    t.string   "grantee"
    t.string   "description"
    t.string   "document_file_name"
    t.string   "document_content_type"
    t.integer  "document_file_size"
  end

  create_table "transcriptions", force: true do |t|
    t.text     "content"
    t.integer  "user_id"
    t.integer  "document_id"
  end

  create_table "users", force: true do |t|
    t.string   "email"
    t.string   "password_digest"
    t.string   "role"
  end

该应用程序非常简单。我正在使用回形针将图像存储在 S3 上，学生将创建一个“转录”，它只是一个文本字段。然后，我们将使文本可搜索。

这些是旧文件，有很多难以辨认的文字。我希望用户能够以某种方式表示难以辨认的单词，并希望以后能够以编程方式识别该单词。一个用例可能是当其他人（不是原始转录员）正在查看转录时，他们可能能够对难以辨认的单词提出建议（或编辑）。

例如，用户可能会在文档/图像中看到句子“See Jack Rzn”。因此，如果他们无法解释这个词，他们可能会在文本区域中输入“See Jack ---”。或者，如果他们认为他们知道这个词是什么，但不确定他们是否可以做类似“见杰克-！跑！-。后来我可以寻找 --- 或 -！* ！- 的实例来识别难以辨认文本。

我只是吐口水，但只是想知道是否有一些角色可以让我在以后用这些转录做“其他事情”时减少悲伤。

score 1 · Accepted Answer

经过本周的一些研究，这里有一些想法。

首先，史密森尼有一个众包数字化项目，他们推荐以下指导方针：

If you find a word you can’t quite read

Please make a note using double brackets [[ ]] like this: [[good guess?]] or simply [[?]]. Save your work and you can continue transcribing the rest of the item.

...更多信息：https ://transcription.si.edu/instructions

其次，有几个“现成的”选项。http://scripto.org/omeka/基于 Omeka DH 工具。

对于 Rails 的人来说，有 fromthepage，https://github.com/benwbrum/fromthepage。这是一个 wiki 风格的应用程序，允许转录者在手写文档上进行协作。

ruby-on-rails - 什么是合适的字符串来表示数字人文转录中难以辨认的数据？

1 回答 1

Related

Reference