0

我一直在尝试使用 pypandoc 将下面代码中的 HTML 字符串 question_text_html(这是一个用 HTML 编写的数学问题)转换为乳胶字符串。但它一直在转换的字符串中包含不相关的字符串,如“\protect\hypertarget{MJX-...}.....”

import pypandoc
from selenium import webdriver

driver.get("https://nigerianscholars.com/past-questions/mathematics/? 
    show_answers=yes")
question_blocks=driver.find_elements_by_class_name('question_block')
for question_block in question_blocks:
 question_text=question_block.find_element_by_class_name('question_text')
 question_text_html=question_text.get_attribute('innerHTML')
 question_latex=pypandoc.convert_text(question_text_html,'tex',format='html')
 print(f'Question Html is {question_text_html}')
 print(f'Question latex is {question_latex}')
 

它通常给

 Question Html is <html><body><p class="q_question">Differentiate <span class="MathJax_Preview" style="color: inherit;"></span><span class="mjx-chtml MathJax_CHTML" data-mathml='&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;5&lt;/mn&gt;&lt;msup&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;&amp;#x2212;&lt;/mo&gt;&lt;mn&gt;4&lt;/mn&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;/math&gt;' id="MathJax-Element-1-Frame" role="presentation" style="font-size: 114%; position: relative;" tabindex="0"><span aria-hidden="true" class="mjx-math" id="MJXc-Node-1"><span class="mjx-mrow" id="MJXc-Node-2"><span class="mjx-mo" id="MJXc-Node-3"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">(</span></span><span class="mjx-mn" id="MJXc-Node-4"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">2</span></span><span class="mjx-mi" id="MJXc-Node-5"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.221em; padding-bottom: 0.309em;">x</span></span><span class="mjx-mo MJXc-space2" id="MJXc-Node-6"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.309em; padding-bottom: 0.441em;">+</span></span><span class="mjx-mn MJXc-space2" id="MJXc-Node-7"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">5</span></span><span class="mjx-msubsup" id="MJXc-Node-8"><span class="mjx-base"><span class="mjx-mo" id="MJXc-Node-9"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">)</span></span></span><span class="mjx-sup" style="font-size: 70.7%; vertical-align: 0.513em; padding-left: 0px; padding-right: 0.071em;"><span class="mjx-mn" id="MJXc-Node-10" style=""><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">2</span></span></span></span><span class="mjx-mo" id="MJXc-Node-11"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">(</span></span><span class="mjx-mi" id="MJXc-Node-12"><span class="mjx-char MJXc-TeX-math-I" style="padding-top: 0.221em; padding-bottom: 0.309em;">x</span></span><span class="mjx-mo MJXc-space2" id="MJXc-Node-13"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.309em; padding-bottom: 0.441em;">−&lt;/span></span><span class="mjx-mn MJXc-space2" id="MJXc-Node-14"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.397em; padding-bottom: 0.353em;">4</span></span><span class="mjx-mo" id="MJXc-Node-15"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.485em; padding-bottom: 0.572em;">)</span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mn>2</mn><mi>x</mi><mo>+</mo><mn>5</mn><msup><mo stretchy="false">)</mo><mn>2</mn></msup><mo stretchy="false">(</mo><mi>x</mi><mo>−&lt;/mo><mn>4</mn><mo stretchy="false">)</mo></math></span></span><script id="MathJax-Element-1" type="math/tex">(2x+5)^2(x-4)</script> with respect to x.</p></body></html>






Question latex is Differentiate
{}\protect\hypertarget{MathJax-Element-1-Frame}{}{\protect\hypertarget{MJXc-Node-1}{}{\protect\hypertarget{MJXc-Node-2}{}{\protect\hypertarget{MJXc-Node-3}{}{{(}}\protect\hypertarget{MJXc-Node-4}{}{{2}}\protect\hypertarget{MJXc-Node-5}{}{{x}}\protect\hypertarget{MJXc-Node-6}{}{{+}}\protect\hypertarget{MJXc-Node-7}{}{{5}}\protect\hypertarget{MJXc-Node-8}{}{{\protect\hypertarget{MJXc-Node-9}{}{{)}}}{\protect\hypertarget{MJXc-Node-10}{}{{2}}}}\protect\hypertarget{MJXc-Node-11}{}{{(}}\protect\hypertarget{MJXc-Node-12}{}{{x}}\protect\hypertarget{MJXc-Node-13}{}{{−}}\protect\hypertarget{MJXc-Node-14}{}{{4}}\protect\hypertarget{MJXc-Node-15}{}{{)}}}}{\((2x + 5)^{2}(x - 4)\)}}\((2x+5)^2(x-4)\)
with respect to x.

如何从乳胶中删除所有“\protect\hypertarget{MJXc-Node-10}”,只留下

Differentiate {\((2x + 5)^{2}(x - 4)\)}}\((2x+5)^2(x-4)\)
with respect to x.
4

1 回答 1

0

使用 MathJax,方程最初实际上是用 TeX 表示法存在的。跨度由 MathJax Javascript 为 HTML 中的公式布局创建。目前,您让 MathJax 先渲染方程,抓取渲染的方程,然后尝试将其转换回原始的 TeX 方程。直接读取 TeX 方程而不使用 Javascript 渲染会更直接。

为此,您只需要在 Selenium 中禁用 Javascript。例如使用 Firefox 驱动程序,这应该可以解决问题:

from selenium.webdriver.firefox.options import Options
from selenium import webdriver

opts = Options()
opts.preferences.update({
    "javascript.enabled": False,
})
driver = webdriver.Firefox(options=opts)

或者,如果您出于某种原因需要在启用 Javascript 的情况下处理呈现的版本,您可以尝试在<p>. 它包含完整的方程式,但没有 TeX 数学标记:

<p class="q_question">...<script type="math/tex">(2x+5)^2(x-4)</script>...</p>

这样您就不必删除跨度。然后,您需要将它包含在\(...\)PDF 的 TeX 数学标记中。

于 2021-01-10T20:34:52.777 回答