0

我正在尝试从此网页收集数据:

https://www.biharjobportal.com/bihar-police-constable-bharti/

我设法使用此代码从网站上删除了所有 GoogleAd,因为它有一个类名,所以很容易:

 var theaders = document.getElementsByClassName('adsbygoogle');
for (var i=theaders.length-1; i >=0; i--)
{
    theaders[i].parentElement.removeChild(theaders[i]);
}

但是网页有这个没有IDS、类名等的元素。(请看截图):

在此处输入图像描述

我只知道要删除的元素在这些评论之间:

     <!-- WP QUADS Content Ad Plugin v. 2.0.17  -->

    **codes to remove (as in the picture)**

    <!-- WP QUADS Content Ad Plugin v. 2.0.17  -->

我尝试使用 XPATH 删除所有此类项目,但什么也没发生,这是我写的代码:

    var badTableEval = document.evaluate (
    "/html/body/div[1]/div/div[1]/main/article/div/div/ul[3]",
    document.documentElement,
    null,
    XPathResult.FIRST_ORDERED_NODE_TYPE,
    null
);

if (badTableEval  &&  badTableEval.singleNodeValue) {
    var badTable  = badTableEval.singleNodeValue;
    badTable.parentNode.removeChild (badTable);
}

如何从网页中删除所有这些元素?https://www.biharjobportal.com/bihar-police-constable-bharti/

4

1 回答 1

2

您可以通过这种方式检测文档中的注释(参见代码片段)。现在由你来设计一些巧妙的功能来删除评论之间的元素。. 好的,您要求它,包括删除相等评论之间的元素的方法。

const root = document.querySelector("body");
const allEls = [...root.childNodes];
const IS_COMMENT = 8;

allEls.forEach((el, i) => {
  if (el.nodeType === IS_COMMENT) {
    // we have a comment. Find the (index of) next equal comment in [allEls]
    // from this point on
    const subset = allEls.slice(i + 1);
    const hasEqualNextComment = subset
      .findIndex(elss =>
        elss.nodeType === IS_COMMENT &&
        elss.textContent.trim() === el.textContent.trim());

    // if an equal comment has been found, remove every element between 
    // the two comment elements
    if (hasEqualNextComment > -1) {
      subset.slice(1, hasEqualNextComment - 1)
        .forEach(elss =>
          elss.parentNode && elss.parentNode.removeChild(elss));
    }
  }
});
body {
  font: normal 12px/15px verdana, arial;
  margin: 2rem;
}
<!-- WP QUADS Content Ad Plugin v. 2.0.17  -->
<ul>
  <li>item 1</li>
  <li>item 2</li>
  <li>item 3</li>
</ul>
<!-- WP QUADS Content Ad Plugin v. 2.0.17  -->

<!-- other comment -->
<ul>
  <li>item 4</li>
  <li>item 5</li>
  <li>item 6</li>
</ul>
<!-- other comment: the above is kept -->

<!-- something 2 remove -->
<div>item 7</div>
<!--something 2 remove-->
<div>item 8</div>

<p>
  <b>The result should show item 4 - item 6, item 8 and the 
    text within this paragraph</b>.
  <br><i>Note</i>: this will only work for top level comments 
  within the given [root] (so, not for comments that nested 
  within elements).
  <br>Also you may have to clean multiline-comments
  from line endings for comparison.
</p>

于 2020-11-24T12:58:22.983 回答