I am relatively new to Google Cloud Platform. I have a large dataset (18 Million articles). I need to do an entity-sentiment analysis using GCP's NLP-API. I am not sure if the way I have been conducting my analysis is the most optimal way in terms of the time it takes to get the entity-sentiment for all the articles. I wonder if there is a way to batch-process all these articles instead of iterating through each of them and making an API call. Here is a summary of the process I have been using.
- I have about 500 files each of which contains about 30,000 articles.
- Using a python script on my local server, I iterate through each file and for each article, I call the function given here
- I store the entire output for each article in a protobuf.
After this step, I don't require the Google API and perform my final analysis on the API output stored in the protobufs.
This worked well enough for a research project where I had about 1.5 Million articles and took a few days. Now that I have 18 Million articles, I wonder if there is a better way to go about this. The articles I have read about batch-processing are geared towards making an app or image processing tasks. There was something like what I wanted here but I am not sure if I can do this with NLP-API.
This is a snippet of my code and DF is a Pandas data frame where I have my articles.
def entity_sentiment_text(text):
"""Detects entity sentiment in the provided text."""
if isinstance(text, six.binary_type):
text = text.decode('utf-8')
document = types.Document(
content=text.encode('utf-8'),
type=enums.Document.Type.PLAIN_TEXT)
# Detect and send native Python encoding to receive correct word offsets.
encoding = enums.EncodingType.UTF32
if sys.maxunicode == 65535:
encoding = enums.EncodingType.UTF16
result = client.analyze_entity_sentiment(document, encoding)
return result
for i,id_val in enumerate(article_ids):
loop_start = time.time()
if i%100 == 0:
print i
# create dynamic name, like "D:\Current Download\Attachment82673"
dynamic_folder_name = os.path.join(folder, str(i))
# create 'dynamic' dir, if it does not exist
if not os.path.exists(dynamic_folder_name):
os.makedirs(dynamic_folder_name)
file_name = str(id_val) + ".txt"
text = list(DF.loc[id_val])[1]
try:
text = unicode(text, errors='ignore')
result = entity_sentiment_text(text)
# print result
with open(dynamic_folder_name + "/" + str(id_val) + ".bin", 'w') as result_file:
result_file.write(result.SerializeToString())
result_file.close()
except Exception as e:
print(e)
with open("../"article_id_error.log", "a") as error_file:
error_file.write(json.dumps(str(id_val) + "\n"))
log_exception(e,id_val)
Note that this is a one-time analysis for research and I am not building an app. I also know that I cannot reduce the number of calls to the API. In summary, if I am making 18 Million calls, what is the quickest way to make all these calls instead of going through each article and calling the function individually?
I feel like I should be doing some kind of parallel processing, but I am a bit wary about spending more time learning about Dataproc without knowing if that will help me with my problem.