0

This is a pretty complicated question so be prepared! I want to generate some test data in excel for my EAV table. The columns I have are:

user_id, attribute, value

Each user_id will repeat for a random number of times between 1-4, and for each entry I want to pick a random attribute from a list, and then a random value which this can take on. Lastly I want the attributes for each id entry to be unique i.e. I do not want more than one entry with the same id and attribute. Below is an example of what I mean:

user_id attribute   value
100001  gender      male
100001  religion    jewish
100001  university  imperial
100002  gender      female
100002  course      physics

Possible values:

attribute   value
gender      male
            female
course      maths
            physics
            chemistry
university  imperial
            cambridge
            oxford
            ucl
religion    jewish
            hindu
            christian
            muslim

Sorry that the table above messed up. I don't know how to paste into here while retaining the structure! Hopefully you can see what I'm talking about otherwise I can get a screenshot.

How can I do this? In the past I have generated random data using a random number generator and a VLOOKUP but this is a bit out of my league.

4

2 回答 2

1

My approach is to create a table with all four attributes for each ID and then filter that table randomly to get between one and four filtered rows per ID. I assigned a random value to each attribute. The basic setup looks like this:

randomized eav table with lookup table

To the left is the randomized eav table and to the left is the lookup table used for the randomized values. Here's the formulas. Enter them and copy down:

Column A - Establishes a random number every four digits. This determines the attribute that must be selected:

=IF(COUNTIF(C$2:C2,C2)=1,RANDBETWEEN(1,4),A1)

Column B - Uses the formula in A to determine if row is included:

=IF(COUNTIF(C$2:C2,C2)=A2,TRUE,RANDBETWEEN(0,1)=1)

Column C - Creates the IDs, starting with 100,001:

=(INT((ROW()-2)/4)+100000)+1

Column D - Repeats the four attributes:

=CHOOSE(MOD(ROW()-2,4)+1,"gender","course","university","religion")

Column E - Finds the first occurence of the Column D attribute in the lookup table and selects a randomly offset value:

=INDEX($H$2:$H$14,(MATCH(D2,$G$2:$G$14,0))+RANDBETWEEN(0,COUNTIF($G$2:$G$14,D2)-1))

When you filter on the TRUEs in Column B you'll get your list of one to four Attributes per ID. Disconcertingly, the filtering forces a recalculation, so the filtered list will no longer say TRUE for every cell in column B.

If this was mine I'd automate it a little more, perhaps by putting the "magic number" 4 in it's own cell (the count of attributes).

于 2012-08-18T22:15:01.430 回答
0

There are a number of ways to do this. You could use either perl or python. Both have modules for working with spreadsheets. In this case, I used python and the openpyxl module.

# File:  datagen.py
# Usage: datagen.py <excel (.xlsx) filename to store data>
# Example:  datagen.py myfile.xlsx

import sys
import random
from openpyxl import Workbook
from openpyxl.cell import get_column_letter

# verify that user specified an argument
if len(sys.argv) < 2:
    print "Specify an excel filename to save the data, e.g myfile.xlsx"
    exit(-1)

# get the excel workbook and worksheet objects
wb = Workbook()
ws = wb.get_active_sheet()

# Modify this line to specify the range of user ids
ids = range(100001, 100100)

# data structure for the attributes and values
data = { 'gender':      ['male',    'female'], 
         'course':      ['maths',   'physics',  'chemistry'],
         'university':  ['imperial','cambridge','oxford',   'ucla'],
         'religion':    ['jewish',  'hindu',    'christian','muslim']}

# Write column headers in the spreadsheet          
ws.cell('%s%s'%('A', 1)).value = 'user_id'
ws.cell('%s%s'%('B', 1)).value = 'attribute'
ws.cell('%s%s'%('C', 1)).value = 'value'

row = 1

# Loop through each user id
for user_id in ids:
    # randomly select how many attributes to use
    attr_cnt = random.randint(1,4)
    attributes = data.keys()
    for idx in range(attr_cnt):
        # randomly select attribute
        attr = random.choice(attributes)
        # remove the selected attribute from further selection for this user id
        attributes.remove(attr)
        # randomly select a value for the attribute
        value = random.choice(data[attr])
        row = row + 1
        # write the values for the current row in the spreadsheet
        ws.cell('%s%s'%('A', row)).value = user_id
        ws.cell('%s%s'%('B', row)).value = attr
        ws.cell('%s%s'%('C', row)).value = value

# save the spreadsheet using the filename specified on the cmd line
wb.save(filename = sys.argv[1]) 
print "Done!"
于 2012-08-18T22:17:48.283 回答