2

On OSX, I've converted a Powerpoint deck to ASCII text, and now want to process this with awk.

  • I want to split the file into multiline records corresponding to slides in the deck.
  • Treating any line beginning with a capital latin letter provides a good approximation, but I can't figure out doing this in awk.
  • I've tried resetting the record separator, RS = "\n^[A-Z]" and RS = "\n^[[:alnum:]][[:upper:]]", and various permutations, but none differentiate. That is, awk keeps treating each individual as a record, rather than grouping them as I want.

The cleaned text looks like this:

Welcome
++  Class will focus on:
–   Basics of SQL syntax
–   SQL concepts analogous to Excel concepts
Who Am I
++  Self-taught on LAMP(ython) stack
++  Plus some DNS, bash scripting, XML / XSLT
++  Prior professional experience:
–   Office of Management and Budget
–   Investment banking (JP Morgan, UBS, boutique)
–   MBA, University of Chicago


Roadmap
+   Preliminaries
+   What is SQL
+   Excel vs SQL
+   Moving data from Excel to SQL and back
+   Query syntax basics
-   Running queries
-   Filtering, grouping
-   Functions
-   Combining tables
+   Using queries for analysis

Some 'slides' have blank lines, some don't.

Once past these hurdles I plan to wrap each record in an tag for use in deck.js. But getting the record definitions right is killing me.

How do I do those things?

EDIT: The question initially asked also about converting Unicode bullet characters to ASCII, but I've figured that out. Some remarks in comments focus on that stuff.

4

2 回答 2

1

awk可以尝试使用以下方法收集记录:

/^[[:upper:]]/ {
    if (r>0) print rec 
    r=1; rec=$0 RS;    next
}
{  
    rec=rec $0 RS 
}

END {
    print rec
}

要删除项目符号,您可以使用

gsub (/•/,"++",rec)
于 2013-10-20T15:33:18.597 回答
1

您可以尝试使用 OSX 中内置的“textutil”实用程序在脚本中转换文件,以节省您手动完成的所有操作。尝试在终端窗口中输入以下内容,然后按移动到下一页:

man textutil

一旦你得到了一些转换的文本,试着发布它,这样人们就可以看到输入的样子,然后也许有人可以帮助你按照你想要的方式拆分它。

于 2013-10-20T16:16:22.240 回答