首先让我说这不是一个万无一失的脚本,很可能有一些我忘记或忽略的东西,但它是一个概念证明,供您改进和扩展或只是获得一个想法。
文本布局中有足够的规律供我们使用,脚本所做的是将转录文本拆分为一组行,并将这些行与一些模式进行匹配,以尝试识别规律并确定数据的作用。
示例脚本:
<?php
/*
Proof of Concept : Transcript to XML by Robjong
? :
- action on date change (what to do when the date changes?)
- what to do with lines like "MR MARK HUGHES (sworn)" (make it a note?!)
- what to do with lines like "Questions by MR BARR" (make it a note?!)
- detect events/notes in quotes better? (e.g: MR BLENDIS: (Nods head).)
Notes :
- desperately needs error checking/handling!!!! (for now it just got in the way)
- it might well be that the configuration of PHP will cause file_get_contents to fail,
try curl or download it manually and read the local file
- if you are running PHP < 5.2.4, change the \h in the pattern to \s or [\t ]
*/
# basic usage
// get the transcript as plain text
$txt = file_get_contents( 'http://www.levesoninquiry.org.uk/wp-content/uploads/2012/02/Transcript-of-Morning-Hearing-2-February-2012.txt' );
// convert transcript to XML
$xml = transcriptToXML_beta( $txt );
// we have the transcript as XML, now what?
file_put_contents( 'transcript.xml', $xml ); // let's write it to a file
echo $xml;
function transcriptToXML_beta( $string ) { // beta is just to emphasize lack of torough testing
$lines = explode( "\n", $string ); // split text into an array array of lines
if( !is_array( $lines ) ) { // the provided string was not multiline
return false;
}
// these vars will hold the data we need to build our XML
$date = ''; // the date of the transcript
$time = ''; // transcript time
$page = 1; // this will hold the current page number
$linenr = ''; // this will hold the line nr
$speaker = ''; // the name of the speaker
$text = ''; // transcribed text attributed to the speaker
$new = false; // will be true if a new item has been matched
$event = ''; // this will hold notes that are in a quote but need to be stored separately (events)
$xml = ''; // this will be the XML string
$i = 0; // count the lines to display actual line number for debugging
foreach( $lines as $line ) { // loop over lines
$i++;
if( !preg_match( "/[[:graph:]]/", $line ) ) { // line is empty, does not contain printable characters....
continue; // ....so we skip to the next one
}
if( preg_match( "/^\h*\d+\h+(?P<date>[a-z]+,\h+\d+\h+[a-z]+\h\d{4})\s*$/i", $line, $match ) ) { # it looks like a date
$date = $match['date']; // set date
$speaker = ''; // reset vars
$text = '';
continue;// no need to handle this line any further
} elseif( preg_match( "/^\h*\d+\h+([A-Z]+(?:\s+[A-Z]+){0,4}\h+\(.*?\)|(?i:questions\h+by)[A-Z\h]+)\s*$/", $line, $match ) ) { # (qued) event, uppercase text followed by text between parentheses
$event .= " <event page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry to que, to be added after current quote
continue;// no need to handle this line any further
} elseif( preg_match( "/^\h*(\d*)\h*\(\h*(?P<time>\d{1,2}[:.]\d{1,2}\h*[ap]m)\)\s*$/i", $line, $match ) ) { # seems we have a time entry
$time = $match['time']; // set date
$xml .= " <time page=\"{$page}\" line=\"{$match[1]}\">" . strtoupper( str_replace( '.', ':', $match['time'] ) ) . "</time>\n"; // add time as entry
$speaker = ''; // reset vars
$text = '';
continue;// no need to handle this line any further
} elseif( preg_match( "/^\h*(\d+)\s*$/", $line, $match ) ) { # line has just one or more digits, we assume its a pagenr
if( $match[1] <= $page ) { // if the number is lower then the current page number ignore it, this avoids triggering errors for 'empty lines' that only have a line number
continue;
}
$page = (int) $match[1] + 1; // set pagenr, add one because the nr is at the bottom of the page
continue;// no need to handle this line any further
} elseif( preg_match( "/^\h*\d+\s+\(([[:print:]]+)\)\s*$/", $line, $match ) && !$speaker ) { # note, text is between parentheses
$xml .= " <event page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry as note
continue;// no need to handle this line any further
} elseif( preg_match( "/^\h*\d+\h+[A-Z\h]+\(.*?\)\s*$/", $line, $match ) && !$speaker ) { # note, uppercase text followed by text between parentheses, only if not in quote
$xml .= " <event type=\"note\" speaker=\"\" page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry as note
continue;// no need to handle this line any further
} elseif( preg_match("/^\h*(?P<linenr>\d+)\h+(?P<speaker>(?:\h[A-Z]+(?:\h[A-Z]+){0,4}))[:.]\h*(?P<text>[[:print:]]+?)\s*$/", $line, $match ) ) { # new speaker entry
if( $new && $speaker ) { // if we have one open we need to add it first
$xml .= " <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n"; // add quote
$new = false; // reset
if( $event ) { // if we have a qued note we need to add that too
$xml .= $event; // add entry to XML string
$event = ''; // clear 'que'
}
}
$speaker = trim( $match['speaker'] ); // assign match to speaker var
$linenr = (int) $match['linenr']; // assign line number
$text = trim( $match['text'] ); // assign text
$new = true; // set new match bool
} elseif( preg_match( "/^\h*(?P<linenr>\d+)\h+(?P<text>[[:print:]]+?)\s*$/", $line, $match ) ) { # follow up text
$text .= ' ' . trim( $match['text'] ); // append text
} else { # unkown line (add check for linenr only lines or remove $match[1] >= $page from the pagenr match conditional)
// not sure what kind of line this is... add it as a separate 'type'?!
trigger_error( "Unable to parse line {$i} \"{$line}\"" ); // throw exception / trigger error
continue; // no need to handle this line any further
}
if( !$new && $speaker ) {
$xml .= " <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n";
$speaker = ''; // reset vars
$text = '';
$new = false;
if( $event ) { // if we have a qued note we need to add that too
$xml .= $event; // add entry to XML string
$event = ''; // clear 'que'
}
}
}
// if we have a match open we need to handle it, this might happen because we do not assign the match in the same iteration as we matched it
if( $new ) {
$xml .= " <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n";
}
if( !trim( $xml ) ) { // no text found so $xml is still an empty string
return false;
}
$date = new DateTime( $date ); // instantiate datetime with the time from the transcript
$date = date( 'Y-m-d', $date->getTimestamp() ); // format date
// now we need to wrap the nodes in a root node
$xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<hearing date=\"{$date}\">\n{$xml}</hearing>\n";
return $xml; // return the XML
}
?>
我将在今天晚些时候更新评论和脚本
输出样本:
<hearing date="2012-02-02">
<time page="1" line="2">10:00 AM</time>
<entry type="quote" speaker="LORD JUSTICE LEVESON" page="1" line="3">Good morning.</entry>
<entry type="quote" speaker="MR BARR" page="1" line="4">Good morning, sir. We're going to start today with witnesses from the mobile phone companies, Mr Blendis from Everything Everywhere, Mr Hughes from Vodafone and Mr Gorham from Telefonica.</entry>
<entry type="quote" speaker="LORD JUSTICE LEVESON" page="1" line="8">Very good.</entry>
<entry type="quote" speaker="MR BARR" page="1" line="9">We're going to listen to them all together, sir. Can I ask that the gentlemen are sworn in, please.</entry>
<event page="1" line="9">MR JAMES BLENDIS (affirmed)</event>
<event page="1" line="9">MR ADRIAN GORHAM (sworn)</event>
<event page="1" line="9">MR MARK HUGHES (sworn)</event>
<event page="1" line="9">Questions by MR BARR</event>
顺便说一句,只是出于好奇,你需要这个做什么?