0

I am reading a large amount of data from an API provider. Once get the response, I need to scan through and repackage the data and put into App Engine datastore. A particular big account will contain ~50k entries.

Every time I get some entries from the API, I will store 500 entries as a batch in a temp table and send the processing task to a queue. In case too many tasks get jammed inside one queue, I use 6 queues in total:

count = 0 
worker_number = 6
for folder, property in entries:
                    data[count] = {
                        # repackaging data here
                    }

                    count = (count + 1) % 500

                    if count == 0:
                        cache = ClientCache(parent=user_key, data=json.dumps(data))
                        cache.put()
                        params = {
                            'access_token': access_token,
                            'client_key': client.key.urlsafe(),
                            'user_key': user_key.urlsafe(),
                            'cache_key': cache.key.urlsafe(),
                        }
                        taskqueue.add(
                            url=task_url,
                            params=params,
                            target='dbworker',
                            queue_name='worker%d' % worker_number)
                        worker_number = (worker_number + 1) % 6

And the task_url will lead to the following:

logging.info('--------------------- Process File ---------------------')
        user_key = ndb.Key(urlsafe=self.request.get('user_key'))
        client_key = ndb.Key(urlsafe=self.request.get('client_key'))
        cache_key = ndb.Key(urlsafe=self.request.get('cache_key'))

        cache = cache_key.get()
        data = json.loads(cache.data)
        for property in data.values():
            logging.info(property)
            try:
                key_name = '%s%s' % (property['key1'], property['key2'])
                metadata = Metadata.get_or_insert(
                    key_name,
                    parent=user_key,
                    client_key=client_key,
                    # ... other info
                )
                metadata.put()
            except StandardError, e:
                logging.error(e.message)

All the tasks are running in the backend.

With such structure, it's working fine. well... most of time. But sometimes I get this error:

2013-09-19 15:10:07.788
suspended generator transaction(context.py:938) raised TransactionFailedError(The transaction could not be committed. Please try again.)
W 2013-09-19 15:10:07.788
suspended generator internal_tasklet(model.py:3321) raised TransactionFailedError(The transaction could not be committed. Please try again.)
E 2013-09-19 15:10:07.789
The transaction could not be committed. Please try again.

It seems to be the problem of writing into datastore too frequently? I want to find out how I can balance the pace and let the worker run smoothly... Also is there any other way I can improve the performance further? My queue configuration is something like this:

- name: worker0
  rate: 120/s
  bucket_size: 100
  retry_parameters:
    task_retry_limit: 3
4

1 回答 1

2

您一次编写单个实体。

如何修改您的代码以使用ndb.put_multi 它批量编写将减少每个事务的往返时间。

以及为什么每次都覆盖记录时使用 get_or_insert 。你也可以只写。这两个都会大大减少工作量

于 2013-09-19T08:13:44.633 回答