我想知道如何获得三元组的确切频率。我认为我使用的功能更多是为了获得“重要性”。这有点像频率,但不一样。
需要明确的是,三字组是连续 3 个单词。标点符号不会影响三元组,至少我不想。
我对频率的定义是:我想要三元组的评论数量,至少一次。
以下是我通过网络抓取获取数据库的方法:
import re
import json
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime
import time
import random
root_url = 'https://fr.trustpilot.com/review/www.gammvert.fr'
urls = [ '{root}?page={i}'.format(root=root_url, i=i) for i in range(1,807) ]
comms = []
notes = []
dates = []
for url in urls:
results = requests.get(url)
time.sleep(20)
soup = BeautifulSoup(results.text, "html.parser")
commentary = soup.find_all('section', class_='review__content')
for container in commentary:
try:
comm = container.find('p', class_ = 'review-content__text').text.strip()
except:
comm = container.find('a', class_ = 'link link--large link--dark').text.strip()
comms.append(comm)
note = container.find('div', class_ = 'star-rating star-rating--medium').find('img')['alt']
notes.append(note)
date_tag = container.div.div.find("div", class_="review-content-header__dates")
date = json.loads(re.search(r"({.*})", str(date_tag)).group(1))["publishedDate"]
dates.append(date)
data = pd.DataFrame({
'comms' : comms,
'notes' : notes,
'dates' : dates
})
data['comms'] = data['comms'].str.replace('\n', '')
data['dates'] = pd.to_datetime(data['dates']).dt.date
data['dates'] = pd.to_datetime(data['dates'])
data.to_csv('file.csv', sep=';', index=False)
这是我用来获得我的功能comms_clean
:
def clean_text(text):
text = tokenizer.tokenize(text)
text = nltk.pos_tag(text)
text = [word for word,pos in text if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')
]
text = [word for word in text if not word in stop_words]
text = [word for word in text if len(word) > 2]
final_text = ' '.join( [w for w in text if len(w)>2] ) #remove word with one letter
return final_text
data['comms_clean'] = data['comms'].apply(lambda x : clean_text(x))
data['month'] = data.dates.dt.strftime('%Y-%m')
这是我的数据库的一些行:
这里是我用来获取数据库中三元组频率的函数:
def get_top_n_gram(corpus,ngram_range,n=None):
vec = CountVectorizer(ngram_range=ngram_range,stop_words = stop_words).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
def process(corpus):
corpus = pd.DataFrame(corpus, columns= ['Text', 'count']).sort_values('count', ascending = True)
return corpus
这是这行代码的结果:
trigram = get_top_n_gram(data['comms_clean'], (3,3), 10)
trigram = process(trigram)
trigram.sort_values('count', ascending=False, inplace=True)
trigram.head(10)
让我向您展示它似乎不一致,但数量很少。我将在上面展示我的图片的第 6 个三元组:
df = data[data['comms_clean'].str.contains('très bon état',regex=False, case=False, na=False)]
df.shape
(150, 5)
df = data[data['comms_clean'].str.contains('rapport qualité prix',regex=False, case=False, na=False)]
df.shape
(148, 5)
df = data[data['comms_clean'].str.contains('très bien passé',regex=False, case=False, na=False)]
df.shape
(129, 5)
所以有了我的功能,我们有:
146
143
114
当我检查其中包含该三元组的评论数量时,我得到:
150
148
129
到目前为止还没有,但我宁愿有确切的数字。
所以我想知道:如何获得那个三元组的确切频率?而不是某种重要性。重要性很好,不要误会我的意思,但我也想知道正确的数字。
我试过这个:
from nltk.util import ngrams
for i in range(1,16120):
Counter(ngrams(data['comms_clean'][i].split(), 3))
但我找不到如何连接循环中的所有计数器。
谢谢你。
编辑 :
stop_words = set(stopwords.words('french'))
stop_words.update(("Gamm", "gamm"))
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
lemmatizer = French.Defaults.create_lemmatizer()