r - 如何读取列混乱的文件？

Question

我有一个“\t”分隔的数据文件，如下所示：

Hotel       Price   Location
hotel1      100       A
hotel2      Unknown   B
hotel3      1,200     C
hotel4      <id=?h    B

在“价格”列中，一些数字包含逗号，看起来像“1,200”。某些行的“价格”列混乱，包含“未知”或其他没有“\t”且没有特定模式的内容。

如何阅读此文件，删除所有“价格”混乱的行，并删除数字中的所有逗号？我想要得到的是以下内容：

Hotel       Price   Location
hotel1      100     A
hotel3      1200    C

我试过使用

price <- read.table("hotel.txt", header=TRUE, colClasses=c("Price"="integer"))

它不起作用，因为 scan() 期望“整数”但得到的不是整数。

任何人都可以帮忙吗？

提前致谢。

score 3 · Accepted Answer

分两步：

## remove not numeric like Price
dat <- dat[grepl('[0-9]+',dat$Price),]
# Hotel Price Location
# 1 hotel1   100        A
# 3 hotel3 1,200        C

## convert price to numeric
dat$Price <- as.numeric(gsub(',','',dat$Price))

 Hotel Price Location
1 hotel1   100        A
3 hotel3  1200        C

其中 dat 是：

dat <- read.table(text='Hotel   Price   Location
hotel1  100 A
hotel2  Unknown B
hotel3  1,200   C
hotel4  <id=?h  B',header=TRUE)

r - 如何读取列混乱的文件？

1 回答 1

Related

Reference