I have an Excel Spreadsheet with product data from a website, with the following column headings:
ProductID, ProductDescription
The ProductDescription field contains HTML detailing the full description of a website product, and within each description the weight is displayed as part of a string (e.g. 'Weight is 950g' or 'Weight is 1.5kg') with no space between the number and the unit of weight.
What I wish to do is:
Import the XL spreadsheet into a Pandas Dataframe
Create a new column named 'Weight'
Parse each 'ProductDescription' (approximately 5000 rows of products) and, using regex, find the text where the weight is mentioned (it can be identified as 'XXXXg' or 'XXXXkg') and place this in the 'weight' column of the dataframe as a numerical value (float).
Finally export this new three-column dataframe as an excel file.
I've botched together a small script below, but it is throwing up errors constantly. If anyone could help, I would be most grateful.
import pandas as pd
import re as re
def weight(inputString):
result = [re.search('([0-9.]+[kgG]{1,2})', s) for s in inputString]
return result
excel_file = 'Products.xlsx'
df = pd.read_excel(excel_file)
df['Weight'] = df['ProductDescription'].apply(weight)
I hope you can help. Please excuse my inelegantly stuck-together code! I'm still VERY new to this.