python - Python - Extracting weight from a column (containing a string) in a Pandas DataFrame using regex and adding it to a new column

Question

I have an Excel Spreadsheet with product data from a website, with the following column headings:

ProductID, ProductDescription

The ProductDescription field contains HTML detailing the full description of a website product, and within each description the weight is displayed as part of a string (e.g. 'Weight is 950g' or 'Weight is 1.5kg') with no space between the number and the unit of weight.

What I wish to do is:

Import the XL spreadsheet into a Pandas Dataframe
Create a new column named 'Weight'
Parse each 'ProductDescription' (approximately 5000 rows of products) and, using regex, find the text where the weight is mentioned (it can be identified as 'XXXXg' or 'XXXXkg') and place this in the 'weight' column of the dataframe as a numerical value (float).
Finally export this new three-column dataframe as an excel file.

I've botched together a small script below, but it is throwing up errors constantly. If anyone could help, I would be most grateful.

import pandas as pd
import re as re


def weight(inputString):

    result = [re.search('([0-9.]+[kgG]{1,2})', s) for s in inputString]

    return result

excel_file = 'Products.xlsx'
df = pd.read_excel(excel_file)

df['Weight'] = df['ProductDescription'].apply(weight)

I hope you can help. Please excuse my inelegantly stuck-together code! I'm still VERY new to this.

score 3 · Accepted Answer

You may use

df["Weight"] = (
    df["ProductDescription"]
    .str.extract(r"(?i)(\d+(?:\.\d+)?)\s*[kmd]?g\b", expand=False)
    .astype(float)
)

The (?i)(\d+(?:\.\d+)?)\s*[kmd]?g\b pattern matches:

(?i) - makes the pattern case insensitive
(\d+(?:\.\d+)?) - Group 1: 1+ digits, an optional occurrence of . and 1+ digits
\s* - 0+ whitespaces
[kmd]? - an optional k, m or d
g - a g
\b - word boundary.

python - Python - Extracting weight from a column (containing a string) in a Pandas DataFrame using regex and adding it to a new column

1 回答 1

Related

Reference