我想突出显示一组 PDF 文件中的几个关键字。首先,我们必须识别单个单词并将它们与我的关键字匹配。我找到了一个例子:
class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();
List<string> topicTerms;
public MyLocationTextExtractionStrategy(List<string> topicTerms)
{
this.topicTerms = topicTerms;
}
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
//Add this to our main collection
//filter the meaingless words
string text = renderInfo.GetText();
this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
但是,我发现很多单词都坏了。例如,“stop”将是“st”和“op”。还有其他方法可以识别单个单词及其位置吗?