几天来,我一直在手动将文章转换为 Markdown 语法,而且变得相当乏味。其中一些是 3 或 4 页、斜体和其他强调的文本。有没有更快的方法将 (.rtf|.doc) 文件转换为我可以利用的清理 Markdown 语法?
7 回答
如果你碰巧在 Mac 上,textutil
可以很好地将 doc、docx 和 rtf 转换为 html,pandoc 可以很好地将生成的 html 转换为 markdown:
$ textutil -convert html file.doc -stdout | pandoc -f html -t markdown -o file.md
我有一个脚本,我不久前将它放在一起尝试使用 textutil、pdf2html 和 pandoc 将我扔给它的任何内容转换为 markdown。
ProgTips有一个使用Word 宏的可能解决方案(源下载):
一个简单的宏(源代码下载),用于自动转换最琐碎的事情。这个宏做:
- 替换粗体和斜体
- 替换标题(标记为标题 1-6)
- 替换编号和项目符号列表
它非常有问题,我相信它会挂在更大的文档上,但我并不是说它是一个稳定的版本!:-) 仅供实验使用,根据需要重新编码和重用,如果您找到了更好的解决方案,请发表评论。
资料来源:ProgTips
宏源
安装
- 打开 WinWord,
- 按 Alt+F11 打开 VBA 编辑器,
- 右键单击项目浏览器中的第一个项目
- 选择插入->模块
- 粘贴文件中的代码
- 关闭宏编辑器
- 转到工具>宏>宏;运行名为 MarkDown 的宏
资料来源:ProgTips
资源
如果 ProgTips 删除帖子或网站被清除,用于安全保存的宏源:
'*** A simple MsWord->Markdown replacement macro by Kriss Rauhvargers, 2006.02.02.
'*** This tool does NOT implement all the markup specified in MarkDown definition by John Gruber, only
'*** the most simple things. These are:
'*** 1) Replaces all non-list paragraphs to ^p paragraph so MarkDown knows it is a stand-alone paragraph
'*** 2) Converts tables to text. In fact, tables get lost.
'*** 3) Adds a single indent to all indented paragraphs
'*** 4) Replaces all the text in italics to _text_
'*** 5) Replaces all the text in bold to **text**
'*** 6) Replaces Heading1-6 to #..#Heading (Heading numbering gets lost)
'*** 7) Replaces bulleted lists with ^p * listitem ^p* listitem2...
'*** 8) Replaces numbered lists with ^p 1. listitem ^p2. listitem2...
'*** Feel free to use and redistribute this code
Sub MarkDown()
Dim bReplace As Boolean
Dim i As Integer
Dim oPara As Paragraph
'remove formatting from paragraph sign so that we dont get **blablabla^p** but rather **blablabla**^p
Call RemoveBoldEnters
For i = Selection.Document.Tables.Count To 1 Step -1
Call Selection.Document.Tables(i).ConvertToText
Next
'simple text indent + extra paragraphs for non-numbered paragraphs
For i = Selection.Document.Paragraphs.Count To 1 Step -1
Set oPara = Selection.Document.Paragraphs(i)
If oPara.Range.ListFormat.ListType = wdListNoNumbering Then
If oPara.LeftIndent > 0 Then
oPara.Range.InsertBefore (">")
End If
oPara.Range.InsertBefore (vbCrLf)
End If
Next
'italic -> _italic_
Selection.HomeKey Unit:=wdStory
bReplace = ReplaceOneItalic 'first replacement
While bReplace 'other replacements
bReplace = ReplaceOneItalic
Wend
'bold-> **bold**
Selection.HomeKey Unit:=wdStory
bReplace = ReplaceOneBold 'first replacement
While bReplace
bReplace = ReplaceOneBold 'other replacements
Wend
'Heading -> ##heading
For i = 1 To 6 'heading1 to heading6
Selection.HomeKey Unit:=wdStory
bReplace = ReplaceH(i) 'first replacement
While bReplace
bReplace = ReplaceH(i) 'other replacements
Wend
Next
Call ReplaceLists
Selection.HomeKey Unit:=wdStory
End Sub
'***************************************************************
' Function to replace bold with _bold_, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'***************************************************************
Function ReplaceOneBold() As Boolean
Dim bReturn As Boolean
Selection.Find.ClearFormatting
With Selection.Find
.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Font.Bold = True
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
bReturn = False
While Selection.Find.Execute = True
bReturn = True
Selection.Text = "**" & Selection.Text & "**"
Selection.Font.Bold = False
Selection.Find.Execute
Wend
ReplaceOneBold = bReturn
End Function
'*******************************************************************
' Function to replace italic with _italic_, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'********************************************************************
Function ReplaceOneItalic() As Boolean
Dim bReturn As Boolean
Selection.Find.ClearFormatting
With Selection.Find
.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Font.Italic = True
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
bReturn = False
While Selection.Find.Execute = True
bReturn = True
Selection.Text = "_" & Selection.Text & "_"
Selection.Font.Italic = False
Selection.Find.Execute
Wend
ReplaceOneItalic = bReturn
End Function
'*********************************************************************
' Function to replace headingX with #heading, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'*********************************************************************
Function ReplaceH(ByVal ipNumber As Integer) As Boolean
Dim sReplacement As String
Select Case ipNumber
Case 1: sReplacement = "#"
Case 2: sReplacement = "##"
Case 3: sReplacement = "###"
Case 4: sReplacement = "####"
Case 5: sReplacement = "#####"
Case 6: sReplacement = "######"
End Select
Selection.Find.ClearFormatting
Selection.Find.Style = ActiveDocument.Styles("Heading " & ipNumber)
With Selection.Find
.Text = ""
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
bReturn = False
While Selection.Find.Execute = True
bReturn = True
Selection.Range.InsertBefore (vbCrLf & sReplacement & " ")
Selection.Style = ActiveDocument.Styles("Normal")
Selection.Find.Execute
Wend
ReplaceH = bReturn
End Function
'***************************************************************
' A fix-up for paragraph marks that ar are bold or italic
'***************************************************************
Sub RemoveBoldEnters()
Selection.HomeKey Unit:=wdStory
Selection.Find.ClearFormatting
Selection.Find.Font.Italic = True
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.Font.Bold = False
Selection.Find.Replacement.Font.Italic = False
With Selection.Find
.Text = "^p"
.Replacement.Text = "^p"
.Forward = True
.Wrap = wdFindContinue
.Format = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
Selection.HomeKey Unit:=wdStory
Selection.Find.ClearFormatting
Selection.Find.Font.Bold = True
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.Font.Bold = False
Selection.Find.Replacement.Font.Italic = False
With Selection.Find
.Text = "^p"
.Replacement.Text = "^p"
.Forward = True
.Wrap = wdFindContinue
.Format = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub
'***************************************************************
' Function to replace bold with _bold_, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'***************************************************************
Sub ReplaceLists()
Dim i As Integer
Dim j As Integer
Dim Para As Paragraph
Selection.HomeKey Unit:=wdStory
'iterate through all the lists in the document
For i = Selection.Document.Lists.Count To 1 Step -1
'check each paragraph in the list
For j = Selection.Document.Lists(i).ListParagraphs.Count To 1 Step -1
Set Para = Selection.Document.Lists(i).ListParagraphs(j)
'if it's a bulleted list
If Para.Range.ListFormat.ListType = wdListBullet Then
Para.Range.InsertBefore (ListIndent(Para.Range.ListFormat.ListLevelNumber, "*"))
'if it's a numbered list
ElseIf Para.Range.ListFormat.ListType = wdListSimpleNumbering Or _
wdListMixedNumbering Or _
wdListListNumOnly Then
Para.Range.InsertBefore (Para.Range.ListFormat.ListValue & ". ")
End If
Next j
'inserts paragraph marks before and after, removes the list itself
Selection.Document.Lists(i).Range.InsertParagraphBefore
Selection.Document.Lists(i).Range.InsertParagraphAfter
Selection.Document.Lists(i).RemoveNumbers
Next i
End Sub
'***********************************************************
' Returns the MarkDown indent text
'***********************************************************
Function ListIndent(ByVal ipNumber As Integer, ByVal spChar As String) As String
Dim i As Integer
For i = 1 To ipNumber - 1
ListIndent = ListIndent & " "
Next
ListIndent = ListIndent & spChar & " "
End Function
资料来源:ProgTips
如果你愿意使用这种.docx
格式,你可以使用我放在一起的这个 PHP 脚本,它将提取 XML、运行一些 XSL 转换并输出相当不错的 Markdown 等价物:
https://github.com/matb33/docx2md
请注意,它旨在从命令行工作,并且在其界面中相当基本。但是,它将完成工作!
如果脚本对您来说不够好,我鼓励您将.docx
文件发送给我,以便我可以重现您的问题并修复它。如果您愿意,请在 GitHub 中记录问题或直接与我联系。
Pandoc是一个很好的命令行转换工具,但同样,您首先需要将输入转换为 Pandoc 可以读取的格式,即:
- 降价
- 重构文本
- 纺织品
- HTML
- 乳胶
我们遇到了同样的问题,必须将 Word 文档转换为 markdown。有些是更复杂和(非常)大的文档,包含数学方程式和图像等。所以我制作了这个脚本,它使用多种不同的工具进行转换:https ://github.com/Versal/word2markdown
因为它使用一系列工具,所以更容易出错,但如果您有更复杂的文档,它可能是一个很好的起点。希望对您有所帮助!:)
更新: 它目前仅适用于 Mac OS X,并且您需要安装一些要求(Word、Pandoc、HTML Tidy、git、node/npm)。要使其正常工作,您还需要打开一个空的 Word 文档,然后执行:文件->另存为网页->兼容性->编码->UTF-8。然后将此编码保存为默认值。有关如何设置的更多详细信息,请参阅自述文件。
然后在控制台中运行:
$ git clone git@github.com:Versal/word2markdown.git
$ cd word2markdown
$ npm install
(copy over the Word files, for example, "document.docx")
$ ./doc-to-md.sh document.docx document_files > document.md
然后就可以在目录中找到 Markdowndocument.md
和图片了document_files
。
现在可能有点复杂,所以我欢迎任何使这更容易或使它在其他操作系统上工作的贡献!:)
你试过这个吗?不确定功能的丰富性,但它适用于简单的文本。 http://markitdown.medusis.com/
作为大学 ruby 课程的一部分,我开发了一个可以将 openoffice word 文件 (.odt) 转换为 markdown 的工具。必须做出很多假设才能将其转换为正确的格式。例如,很难确定必须被视为标题的文本的大小。但是,您可以通过这种转换来放松的唯一想法是格式化任何遇到的文本总是附加到降价文档中。我开发的工具支持列表、粗体和斜体文本,并且它具有表格语法。
http://github.com/bostko/doc2text 试一试,请给我您的反馈。