6 回答
If me, I will try to find HTML parser and will do with that.
Another option is will try to chunk the string into <code>.*?</code>
and other parts.
and will update other parts, and will recombine it.
$x="The Hello \ World document is located in:\n<br>
<code>C:\documents\hello_world.txt</code>";
$r=preg_split("/(<code>.*?<\/code>)/", $x,-1,PREG_SPLIT_DELIM_CAPTURE);
for($i=0;$i<count($r);$i+=2)
$r[$i]=str_replace("\\","$\\backslash$",$r[$i]);
$x=implode($r);
echo $x;
Here is the results.
The Hello $\backslash$ World document is located in:
C:\documents\hello_world.txt
Sorry, If my approach is not suitable for you.
I reckon I could solve this using negative LookBehinds and/or LookAheads.
You reckon wrong. Regular expressions are not a replacement for a parser.
I would suggest that you pipe the html through htmltidy, then read it with a dom-parser and then transform the dom to your target output format. Is there anything preventing your from taking this route?
Parser FTW, ok. But if you can't use a parser, and you can be certain that <code>
tags are never nested, you could try the following:
- Find
<code>.*?</code>
sections of your file (probably need to turn on dot-matches-newlines mode). - Replace all backslashes inside that section with something unique like
#?#?#?#
- Replace the section found in 1 with that new section
- Replace all backslashes with
$\backslash$
- Replace als
<code>
with\begin{verbatim}
and all</code>
with\end{verbatim}
- Replace
#?#?#?#
with\
FYI, regexes in PHP don't support variable-length lookbehind. So that makes this conditional matching between two boundaries difficult.
Pandoc? Pandoc converts between a bunch of formats. you can also concatenate a bunch of flies together then covert them. Maybe a few shell scripts combined with your php scraping scripts?
With your "expected input" and the command pandoc -o text.tex test.html
the output is:
The Hello \textbackslash{} World document is located in:
\verb!C:\documents\hello_world.txt!
pandoc can read from stdin, write to stdout or pipe right into a file.
Provided that your <code>
blocks are not nested, this regex would find a backslash after ^
start-of-string or </code>
with no <code>
in between.
((?:^|</code>)(?:(?!<code>).)+?)\\
| | |
| | \-- backslash
| \-- least amount of anything not followed by <code>
\-- start-of-string or </code>
And replace it with:
$1$\backslash$
You'd have to run this regex in "singleline" mode, so .
matches newlines. You'd also have to run it multiple times, specifying global replacement is not enough. Each replacement will only replace the first eligible backslash after start-of-string or </code>
.
Write a parser based on an HTML or XML parser like DOMDocument. Traverse the parsed DOM and replace the \
on every text node that is not a descendent of a code
node with $\backslash$
and every node that is a code
node with \begin{verbatim} … \end{verbatim}
.