I am reading a large amount of data from an API provider. Once get the response, I need to scan through and repackage the data and put into App Engine datastore. A particular big account will contain ~50k entries.
Every time I get some entries from the API, I will store 500 entries as a batch in a temp table and send the processing task to a queue. In case too many tasks get jammed inside one queue, I use 6 queues in total:
count = 0
worker_number = 6
for folder, property in entries:
data[count] = {
# repackaging data here
}
count = (count + 1) % 500
if count == 0:
cache = ClientCache(parent=user_key, data=json.dumps(data))
cache.put()
params = {
'access_token': access_token,
'client_key': client.key.urlsafe(),
'user_key': user_key.urlsafe(),
'cache_key': cache.key.urlsafe(),
}
taskqueue.add(
url=task_url,
params=params,
target='dbworker',
queue_name='worker%d' % worker_number)
worker_number = (worker_number + 1) % 6
And the task_url will lead to the following:
logging.info('--------------------- Process File ---------------------')
user_key = ndb.Key(urlsafe=self.request.get('user_key'))
client_key = ndb.Key(urlsafe=self.request.get('client_key'))
cache_key = ndb.Key(urlsafe=self.request.get('cache_key'))
cache = cache_key.get()
data = json.loads(cache.data)
for property in data.values():
logging.info(property)
try:
key_name = '%s%s' % (property['key1'], property['key2'])
metadata = Metadata.get_or_insert(
key_name,
parent=user_key,
client_key=client_key,
# ... other info
)
metadata.put()
except StandardError, e:
logging.error(e.message)
All the tasks are running in the backend.
With such structure, it's working fine. well... most of time. But sometimes I get this error:
2013-09-19 15:10:07.788
suspended generator transaction(context.py:938) raised TransactionFailedError(The transaction could not be committed. Please try again.)
W 2013-09-19 15:10:07.788
suspended generator internal_tasklet(model.py:3321) raised TransactionFailedError(The transaction could not be committed. Please try again.)
E 2013-09-19 15:10:07.789
The transaction could not be committed. Please try again.
It seems to be the problem of writing into datastore too frequently? I want to find out how I can balance the pace and let the worker run smoothly... Also is there any other way I can improve the performance further? My queue configuration is something like this:
- name: worker0
rate: 120/s
bucket_size: 100
retry_parameters:
task_retry_limit: 3