4
4

6 回答 6

6

If me, I will try to find HTML parser and will do with that.

Another option is will try to chunk the string into <code>.*?</code> and other parts.

and will update other parts, and will recombine it.

$x="The Hello \ World document is located in:\n<br>
<code>C:\documents\hello_world.txt</code>";

$r=preg_split("/(<code>.*?<\/code>)/", $x,-1,PREG_SPLIT_DELIM_CAPTURE);

for($i=0;$i<count($r);$i+=2)
    $r[$i]=str_replace("\\","$\\backslash$",$r[$i]);

$x=implode($r);

echo $x;

Here is the results.

The Hello $\backslash$ World document is located in: 
C:\documents\hello_world.txt

Sorry, If my approach is not suitable for you.

于 2009-11-23T15:59:55.550 回答
3

I reckon I could solve this using negative LookBehinds and/or LookAheads.

You reckon wrong. Regular expressions are not a replacement for a parser.

I would suggest that you pipe the html through htmltidy, then read it with a dom-parser and then transform the dom to your target output format. Is there anything preventing your from taking this route?

于 2009-11-23T15:31:56.047 回答
2

Parser FTW, ok. But if you can't use a parser, and you can be certain that <code> tags are never nested, you could try the following:

  1. Find <code>.*?</code> sections of your file (probably need to turn on dot-matches-newlines mode).
  2. Replace all backslashes inside that section with something unique like #?#?#?#
  3. Replace the section found in 1 with that new section
  4. Replace all backslashes with $\backslash$
  5. Replace als <code> with \begin{verbatim} and all </code> with \end{verbatim}
  6. Replace #?#?#?# with \

FYI, regexes in PHP don't support variable-length lookbehind. So that makes this conditional matching between two boundaries difficult.

于 2009-11-23T15:46:44.097 回答
1

Pandoc? Pandoc converts between a bunch of formats. you can also concatenate a bunch of flies together then covert them. Maybe a few shell scripts combined with your php scraping scripts?

With your "expected input" and the command pandoc -o text.tex test.html the output is:

The Hello \textbackslash{} World document is located in:
\verb!C:\documents\hello_world.txt!

pandoc can read from stdin, write to stdout or pipe right into a file.

于 2009-11-23T17:05:23.097 回答
0

Provided that your <code> blocks are not nested, this regex would find a backslash after ^ start-of-string or </code> with no <code> in between.

((?:^|</code>)(?:(?!<code>).)+?)\\
    |            |              |
    |            |              \-- backslash
    |            \-- least amount of anything not followed by <code>
    \-- start-of-string or </code>

And replace it with:

$1$\backslash$

You'd have to run this regex in "singleline" mode, so . matches newlines. You'd also have to run it multiple times, specifying global replacement is not enough. Each replacement will only replace the first eligible backslash after start-of-string or </code>.

于 2009-11-23T15:55:36.700 回答
0

Write a parser based on an HTML or XML parser like DOMDocument. Traverse the parsed DOM and replace the \ on every text node that is not a descendent of a code node with $\backslash$ and every node that is a code node with \begin{verbatim} … \end{verbatim}.

于 2009-11-23T15:57:12.717 回答