2

I have a csv file something like this

text
RT @CritCareMed: New Article: Male-Predominant Plasma Transfusion Strategy for Preventing Transfusion-Related Acute Lung Injury... htp://…
#CRISPR Inversion of CTCF Sites Alters Genome Topology & Enhancer/Promoter Function in @CellCellPress htp://.co/HrjDwbm7NN
RT @gvwilson: Where's the theory for software engineering? Behind a paywall, that's where. htp://.co/1t3TymiF3M #semat #fail
RT @sciencemagazine: What’s killing off the sea stars? htp://.co/J19FnigwM9 #ecology
RT @MHendr1cks: Eve Marder describes a horror that is familiar to worm connectome gazers. htp://.co/AEqc7NOWoR via @nucAmbiguous htp://…

I want to extract all the mentions (starting with '@') from the tweet text. So far I have done this

import pandas as pd
import re

mydata = pd.read_csv("C:/Users/file.csv")
X = mydata.ix[:,:]
X=X.iloc[:,:1] #I have multiple columns so I'm selecting the first column only that is 'text'

for i in range(X.shape[0]):
result = re.findall("(^|[^@\w])@(\w{1,25})", str(X.iloc[:i,:]))

print(result);

There are two problems here: First: at str(X.iloc[:1,:]) it gives me ['CritCareMed'] which is not ok as it should give me ['CellCellPress'], and at str(X.iloc[:2,:]) it again gives me ['CritCareMed'] which is of course not fine again. The final result I'm getting is

[(' ', 'CritCareMed'), (' ', 'gvwilson'), (' ', 'sciencemagazine')]

It doesn't include the mentions in 2nd row and both two mentions in last row. What I want should look something like this:

enter image description here

How can I achieve these results? this is just a sample data my original data has lots of tweets so is the approach ok?

4

3 回答 3

2

您可以使用str.findall方法来避免 for 循环,使用否定查找来替换(^|[^@\w])您在正则表达式中不需要的另一个捕获组:

df['mention'] = df.text.str.findall(r'(?<![@\w])@(\w{1,25})').apply(','.join)
df
#                                                text   mention
#0  RT @CritCareMed: New Article: Male-Predominant...   CritCareMed
#1  #CRISPR Inversion of CTCF Sites Alters Genome ...   CellCellPress
#2  RT @gvwilson: Where's the theory for software ...   gvwilson
#3  RT @sciencemagazine: What’s killing off the se...   sciencemagazine
#4  RT @MHendr1cks: Eve Marder describes a horror ...   MHendr1cks,nucAmbiguous

X.iloc[:i,:]返回一个数据框,所以str(X.iloc[:i,:])给你一个数据框的字符串表示,它与单元格中的元素有很大不同,从text列中提取实际的字符串,你可以使用X.text.iloc[0],或者更好的方法来迭代列,使用iteritems

import re
for index, s in df.text.iteritems():
    result = re.findall("(?<![@\w])@(\w{1,25})", s)
    print(','.join(result))

#CritCareMed
#CellCellPress
#gvwilson
#sciencemagazine
#MHendr1cks,nucAmbiguous
于 2017-10-08T17:18:54.803 回答
1

虽然您已经有了答案,但您甚至可以尝试优化整个导入过程,如下所示:

import re, pandas as pd

rx = re.compile(r'@([^:\s]+)')

with open("test.txt") as fp:
    dft = ([line, ",".join(rx.findall(line))] for line in fp.readlines())

    df = pd.DataFrame(dft, columns = ['text', 'mention'])
    print(df)


产生:

                                                text                  mention
0  RT @CritCareMed: New Article: Male-Predominant...              CritCareMed
1  #CRISPR Inversion of CTCF Sites Alters Genome ...            CellCellPress
2  RT @gvwilson: Where's the theory for software ...                 gvwilson
3  RT @sciencemagazine: What’s killing off the se...          sciencemagazine
4  RT @MHendr1cks: Eve Marder describes a horror ...  MHendr1cks,nucAmbiguous

df这可能会快一点,因为一旦它已经构建,您就不需要更改它。

于 2017-10-10T05:54:27.507 回答
1
mydata['text'].str.findall(r'(?:(?<=\s)|(?<=^))@.*?(?=\s|$)')

与此相同:从 pandas 数据框的列中提取主题标签,但用于提及。

  • @.*?对以标签开头的单词进行非贪婪匹配
  • (?=\s|$)前瞻词尾或句尾
  • (?:(?<=\s)|(?<=^))如果在单词中间使用@,则向后看以确保没有误报

正则表达式lookbehind断言空格或句子的开头必须在@字符之前。

于 2019-06-11T07:09:26.317 回答