This might not be a good application of NLP as the text isn't any sort of natural language. Rather, they are structured data that can be extracted using rules. Writing rules is definitely one way to go about this.
You can first try to do a fuzzy match of the categories on the OCR results, namely "CARDIAC RISK" and "CHEMISTRIES" to partition the string into their respective categories.
If you are sure that each entry will take only 3 lines, you can simply partition them by newline and extract the data from there.
Once you have them split into entries
Here's some sample code I ran on the data you provided. It requires the fuzzyset
package which you can get by running python3 -m pip install fuzzyset
. Since some entries don't have units I modified your desired output format slightly and made units a list so it can easily be empty. It also stores random letters found in the third line.
from fuzzyset import FuzzySet
### Load data
with open("ocr_result.txt") as f:
data = f.read()
lines = data.split("\n")
### Create fuzzy set
CATEGORIES = ("CARDIAC RISK", "chemistries")
fs = FuzzySet(lines)
### Get the line ranges of each category
cat_ranges = [0] * (len(CATEGORIES) + 1)
for i, cat in enumerate(CATEGORIES):
match = fs.get(cat)[0]
match_idx = lines.index(match[1])
cat_ranges[i] = match_idx
last_idx = lines.index(fs.get("sample lab report")[0][1])
cat_ranges[-1] = last_idx
### Read lines in each category
def _to_float(s: str) -> float:
"""
Attempt to convert a string value to float
"""
try:
f = float(s)
except ValueError:
if "," in s:
s = s.replace(",", ".")
f = float(s)
else:
raise ValueError(f"Cannot convert {s} to float.")
return f
result = {}
for i, cat in enumerate(CATEGORIES):
result[cat] = {}
# Ignore the line of the category itself
s = slice(cat_ranges[i] + 1, cat_ranges[i + 1])
lines_in_cat = lines[s]
if len(lines_in_cat) % 3 != 0:
breakpoint()
raise ValueError("Something's wrong")
for i in range(0, len(lines_in_cat), 3):
_name = lines_in_cat[i]
_value = lines_in_cat[i + 1]
_line_3 = lines_in_cat[i + 2].split(" ")
# Convert value to float
_value = _to_float(_value)
# Process line 3 to get range and unit
_range = []
_unit = []
for i, v in enumerate(_line_3):
if v[0].isdigit() and len(_range) < 2:
_range.append(_to_float(v))
else:
_unit.append(v)
_l = [_value, _unit, _range]
result[cat][_name] = _l
print(result)
Output:
{'CARDIAC RISK': {'CHOLESTEROL': [161.0, ['mg/dL'], [120.0, 240.0]], 'CHOLESTEROLHDL RATIO': [2.39, [], [1.25, 5.0]], 'HIGH DENSITY LIPOPROTEINCHDL)': [67.3, ['me/dL'], [35.0, 75.0]], 'LOW DENSITY LIPOPROTEIN (LDL)': [78.7, ['a', 'midI.'], [60.0, 190.0]], 'TRIGLYCERIDES': [75.0, ['a', 'made'], [10.0, 200.0]]}, 'chemistries': {'ALBUMIN': [4.4, ['pidl'], [3.5, 5.5]], 'ALKALINE PHOSPHATASE': [49.0, ['UAL'], [30.0, 120.0]], 'BLOOD UREA NITROGEN (BUN)': [17.0, ['meidL'], [6.0, 2500.0]], 'CREATININE': [0.85, ['matdL'], [60.0, 1.5]], 'FRUCTOSAMINE': [182.0, ['mmoV/l'], [1.2, 1.79]], 'GAMMA GLUTAMYUTRANSFERASE': [9.0, ['UIL'], [2.0, 65.0]], 'GLOBULIN': [2.8, ['gidL.'], [1.0, 4.0]], 'GLUCOSE': [61.0, ['me/dl.'], [70.0, 125.0]], 'HEMOGLOBIN AIC': [5.1, ['%'], [3.0, 6.0]], 'SGOT (AST)': [25.0, ['UM'], [0.0, 41.0]], 'SOPI (ALT)': [22.0, ['IMI'], [0.0, 45.0]], 'TOTAL BILIRUBIN': [0.52, ['mmeldi.'], [0.1, 1.2]], 'TOTAL PROTEIN': [720.0, ['gidl.'], [6.0, 8.5]]}}