2

I am relatively new to Google Cloud Platform. I have a large dataset (18 Million articles). I need to do an entity-sentiment analysis using GCP's NLP-API. I am not sure if the way I have been conducting my analysis is the most optimal way in terms of the time it takes to get the entity-sentiment for all the articles. I wonder if there is a way to batch-process all these articles instead of iterating through each of them and making an API call. Here is a summary of the process I have been using.

  1. I have about 500 files each of which contains about 30,000 articles.
  2. Using a python script on my local server, I iterate through each file and for each article, I call the function given here
  3. I store the entire output for each article in a protobuf.

After this step, I don't require the Google API and perform my final analysis on the API output stored in the protobufs.

This worked well enough for a research project where I had about 1.5 Million articles and took a few days. Now that I have 18 Million articles, I wonder if there is a better way to go about this. The articles I have read about batch-processing are geared towards making an app or image processing tasks. There was something like what I wanted here but I am not sure if I can do this with NLP-API.

This is a snippet of my code and DF is a Pandas data frame where I have my articles.

def entity_sentiment_text(text):
    """Detects entity sentiment in the provided text."""
    if isinstance(text, six.binary_type):
        text = text.decode('utf-8')
    document = types.Document(
        content=text.encode('utf-8'),
        type=enums.Document.Type.PLAIN_TEXT)
    # Detect and send native Python encoding to receive correct word offsets.
    encoding = enums.EncodingType.UTF32
    if sys.maxunicode == 65535:
        encoding = enums.EncodingType.UTF16
    result = client.analyze_entity_sentiment(document, encoding)
    return result


for i,id_val in enumerate(article_ids):
    loop_start = time.time()
    if i%100 == 0:
        print i
    # create dynamic name, like "D:\Current Download\Attachment82673"
        dynamic_folder_name = os.path.join(folder, str(i))
    # create 'dynamic' dir, if it does not exist
        if not os.path.exists(dynamic_folder_name):
            os.makedirs(dynamic_folder_name)
    file_name = str(id_val) + ".txt"
    text = list(DF.loc[id_val])[1]
    try:
        text = unicode(text, errors='ignore')
        result = entity_sentiment_text(text)
        # print result
        with open(dynamic_folder_name + "/" + str(id_val) + ".bin", 'w') as result_file:
            result_file.write(result.SerializeToString())
            result_file.close()     
    except Exception as e: 
        print(e)
        with open("../"article_id_error.log", "a") as error_file:
            error_file.write(json.dumps(str(id_val) + "\n"))
        log_exception(e,id_val)

Note that this is a one-time analysis for research and I am not building an app. I also know that I cannot reduce the number of calls to the API. In summary, if I am making 18 Million calls, what is the quickest way to make all these calls instead of going through each article and calling the function individually?

I feel like I should be doing some kind of parallel processing, but I am a bit wary about spending more time learning about Dataproc without knowing if that will help me with my problem.

4

1 回答 1

0

You will need to manage merging documents to obtain a smaller total job count. You will also need to rate-limit your requests for both requests per minute and total requests per day.

Pricing is based upon characters in units of 1,000 characters. If you are planning to process 18 million articles (how many words per article?) I would contact Google Sales to discuss your project and arrange for credit approval. You will hit quota limits very quickly and then your jobs will return API errors.

I would start with reading this section of the documentation:

https://cloud.google.com/natural-language/docs/resources

于 2019-10-02T20:57:13.507 回答