-1

我正在 R 中进行一些基本的文本分析,并想计算我加载到 R 中的 .txt 文件的成绩单的行数。下面的示例产生一个计数,其中每个发言者都获得了一个附加到的新行行数使得 Smith 先生 = 4,Gordon 先生 = 6,Catalano 先生 = 3。

[71] "\"511\"\t\"
史密斯先生:议长先生,我喜欢我们就此达成一致的精神。FUFA的管理在这里。FUFA 可以用作管道,但它的意图是什么。Beti Kamya 长大了,什么荣誉。Rose Namayanja 说是okufuwa - 只是对实现这一目标的球员表示感谢。\""
[72] "\"513\"\t\"
戈登先生:非常感谢您,议长先生。FUFA是一个组织,球员是为我们赢得奖杯的人。为了促进所有活动的积极性,不仅是足球,你应该奖励表现出色的人。在这种情况下,我们听说了 FUFA 的问题。他们没有支付水费,他们可以拿这笔钱来支付水费。如果我们同意这笔钱应该给球员和教练,那么当钱到那里时,他们就会知道金额,他们会坐在一起,根据我们给的钱分配。(掌声)谢谢。\""
[73] "\"515\"\t\"
卡塔拉诺先生:议长先生,我想向我亲爱的同事们提供信息。精神很好,但你必须注意,FUFA的管理是造成这一切的原因。钱给球员。这向您表明,FUFA非常值得信赖。这不是我们所说的旧FUFA。\""

函数 countLine() 不起作用,因为它需要连接 - 这些只是导入到 R 中的 .txt。我意识到行数取决于打开文本的格式,但如果有任何一般帮助这是可行的会有所帮助。谢谢。

4

2 回答 2

3

我认为您的示例不可重现,因此我对其进行了编辑以使其包含您发布的内容,但我不知道名称是否匹配:

txtvec <-   structure(list(`'511'   ` = "MR Smith: Mr Speaker, I like the spirit in which we are agreeing on this. The administration of FUFA is present here. FUFA could be used as a conduit, but the intention of what hon. Beti Kamya brought up and what hon. Rose Namayanja has said was okufuwa - just giving a token of appreciation to the players who achieved this.\"", 
    `'513'  ` = "MR Gordon: Thank you very much, Mr Speaker. FUFA is an organisation and the players are the ones who got the cup for us. To promote motivation in all activities, not only football, you should remunerate people who have done well. In this case, we have heard about FUFA with their problems. They have not paid water bills and they can take this money to pay the water bills. If we agree that this money is supposed to go to the players and the coaches, then when it goes there they would know the amount and they will sit among themselves and distribute according to what we will have given. (Applause) I thank you.\"", 
    `'515'  ` = "MR Catalano: Mr Speaker, I want to give information to my dear colleagues. The spirit is very good but you must be mindful that the administration of FUFA is what has made this happen. The money to the players. That indicates to you that FUFA is very trustworthy. This is not the old FUFA we are talking about.\""), .Names = c("'511'\t", 
"'513'\t", "'515'\t"))

所以这只是一个问题,或者在它上面运行一个正则表达式并列出结果:

> table( sapply(txtvec, function(x) sub("(^MR.+)\\:.+", "\\1", x) ) )
#MR Catalano   MR Gordon    MR Smith 
           1           1           1 

有人担心这些名字不在原来的结构中。这是另一个带有未命名向量和稍微修改的正则表达式的版本:

txtvec <-  c("\"511\"\t\"\nMR Smith: Mr Speaker, I like the spirit in which we are agreeing on this. The administration of FUFA is present here. FUFA could be used as a conduit, but the intention of what hon. Beti Kamya brought up and what hon. Rose Namayanja has said was okufuwa - just giving a token of appreciation to the players who achieved this.\"", 
"\"513\"\t\"\nMR Gordon: Thank you very much, Mr Speaker. FUFA is an organisation and the players are the ones who got the cup for us. To promote motivation in all activities, not only football, you should remunerate people who have done well. In this case, we have heard about FUFA with their problems. They have not paid water bills and they can take this money to pay the water bills. If we agree that this money is supposed to go to the players and the coaches, then when it goes there they would know the amount and they will sit among themselves and distribute according to what we will have given. (Applause) I thank you.\"", 
"\"515\"\t\"\nMR Catalano: Mr Speaker, I want to give information to my dear colleagues. The spirit is very good but you must be mindful that the administration of FUFA is what has made this happen. The money to the players. That indicates to you that FUFA is very trustworthy. This is not the old FUFA we are talking about.\""
)

 table( sapply(txtvec, function(x) sub(".+\\n(MR.+)\\:.+", "\\1", x) ) )

#MR Catalano   MR Gordon    MR Smith 
#          1           1           1 

要计算每行 80 个字符的包装设备上占用的“行”数,您可以使用以下代码(可以很容易地将其转换为函数):

 sapply(txtvec, function(tt) 1+nchar(tt) %/% 80)
#[1] 5 8 4
于 2013-03-16T01:04:25.650 回答
2

这是在评论中提出的,但它确实是它自己的答案:

如果不定义“行”是什么,就无法“数行”。行是一个非常模糊的概念,可能会因所使用的程序而异。

当然,除非数据包含一些换行符,例如\n. 但即使那样,你也不会计算数,你会计算换行符。 然后,您将不得不问自己,硬编码的换行符是否与您希望分析的内容一致。

--

如果你的数据不包含换行符,但你还想统计行数,那我们又回到“如何定义行”的问题了?正如@flodel 建议的那样,最基本的方法是使用字符长度。例如,您可以将一行定义为 76 个字符长,然后取

ceiling(nchar(X) / 76))

这当然假设您可以切词。(如果你需要文字保持完整,那么你必须变得更狡猾)

于 2013-03-16T02:49:29.507 回答