0

我正在构建一个程序,通过扫描书名页并使用 OCR 来获取书的出版商……因为出版商总是在书名页的底部我认为检测由空格分隔的行是一种解决方案,但是我不知道如何测试。这是我的代码:

Dim builder As New StringBuilder()
Dim reader As New StringReader(txtOCR.Text)
Dim iCounter As Integer = 0
While True
    Dim line As String = reader.ReadLine()
    If line Is Nothing Then Exit While

    'i want to put the condition here

End While
txtPublisher.Text = builder.ToString()
4

3 回答 3

2

你的意思是空行吗?然后你可以这样做:

Dim bEmpty As Boolean

然后在循环内:

If line.Trim().Length = 0 Then
    bEmpty = True
Else
    If bEmpty Then
        '...
    End If

    bEmpty = False
End If
于 2013-02-26T09:48:51.970 回答
1

为什么不做以下事情:从底部向上直到找到第一条非空行(不知道 OCR 是如何工作的……也许最底部的行总是非空的,在这种情况下这是多余的)。在下一步中,向上直到第一个空行。中间的文字是出版商。

你不需要这样做StringReader

Dim lines As String() = txtOCR.Text.Split(Environment.NewLine)
Dim bottom As Integer = lines.Length - 1

' Find bottom-most non-empty line.
Do While String.IsNullOrWhitespace(lines(bottom))
    bottom -= 1
Loop

' Find empty line above that
Dim top As Integer = bottom - 1

Do Until String.IsNullOrWhitespace(lines(top))
    top -= 1
Loop

Dim publisherSubset As New String(bottom - top)()
Array.Copy(lines, top + 1, publisherSubset, 0, bottom - top)
Dim publisher As String = String.Join(Environment.NewLine, publisherSubset)

但老实说,我不认为这是一个特别好的方法。它不灵活,不能很好地应对意外的格式。我会改为使用正则表达式来描述发布者字符串(及其上下文)的样子。甚至可能这还不够,您必须考虑描述整个页面以推断哪些位是发布者。

于 2013-02-26T10:05:14.983 回答
1

假设发布者总是在最后一行并且总是在空行之后。那么也许是以下?

    Dim Lines as New List(Of String)
    Dim currentLine as String = ""
    Dim previousLine as String = ""

    Using reader As StreamReader = New StreamReader(txtOCR.Txt)
    currentLine = reader.ReadLine
     If String.IsNullOrWhiteSpace(previousLine) then lines.Add(currentLine)
     previousLine = currentLine
    End Using

    txtPublisher.Text = lines.LastOrDefault()

如果前一行为空白/空,则忽略:

Dim Lines as New List(Of String) 
Using reader As StreamReader = New StreamReader(txtOCR.Txt) 
lines.Add(reader.ReadLine) 
End Using 

txtPublisher.Text = lines.LastOrDefault()
于 2013-02-26T10:21:18.607 回答