On OSX, I've converted a Powerpoint deck to ASCII text, and now want to process this with awk.
- I want to split the file into multiline records corresponding to slides in the deck.
- Treating any line beginning with a capital latin letter provides a good approximation, but I can't figure out doing this in awk.
- I've tried resetting the record separator,
RS = "\n^[A-Z]"
andRS = "\n^[[:alnum:]][[:upper:]]"
, and various permutations, but none differentiate. That is, awk keeps treating each individual as a record, rather than grouping them as I want.
The cleaned text looks like this:
Welcome
++ Class will focus on:
– Basics of SQL syntax
– SQL concepts analogous to Excel concepts
Who Am I
++ Self-taught on LAMP(ython) stack
++ Plus some DNS, bash scripting, XML / XSLT
++ Prior professional experience:
– Office of Management and Budget
– Investment banking (JP Morgan, UBS, boutique)
– MBA, University of Chicago
Roadmap
+ Preliminaries
+ What is SQL
+ Excel vs SQL
+ Moving data from Excel to SQL and back
+ Query syntax basics
- Running queries
- Filtering, grouping
- Functions
- Combining tables
+ Using queries for analysis
Some 'slides' have blank lines, some don't.
Once past these hurdles I plan to wrap each record in an tag for use in deck.js. But getting the record definitions right is killing me.
How do I do those things?
EDIT: The question initially asked also about converting Unicode bullet characters to ASCII, but I've figured that out. Some remarks in comments focus on that stuff.