0

我正在尝试使用 Jsoup 废弃网页。Jsoup 似乎没有<input像 Chrome 那样捕获元素。

它缺少如下值:

<input type=​"hidden" id=​"fileId" value=​"3168935269">
<input type=​"hidden" id=​"secondsLeft" value=​"20">​​

使用 Jsoup 我提取了这些元素:

<input type="hidden" class="jsItemDirId" value="yRg1N-QP" />

<input type="hidden" class="jsItemFileId" value="i-EbooI0" />

<input type="hidden" id="fbAppId" value="255519317820035" />

<input type="hidden" id="sPrefix" value="http://search.4shared.com" />

<input type="hidden" class="sLink file" value="/q/CCAD/1" />

<input type="hidden" class="sLink video" value="/q/CCQD/1/video" />

<input type="hidden" class="sLink music" value="/q/CCQD/1/music" />

<input type="hidden" class="sLink photo" value="/q/CCQD/1/photo" />

<input type="hidden" class="sLink games" value="/q/CCQD/1/game" />

<input type="hidden" class="sLink book" value="/q/CCQD/1/books_office" />

<input type="hidden" class="sLink featured_videos" value="/q/CCQD/1/video" />

<input type="hidden" id="sBreadcrumbsPhrase" value="Searching" />

<input type="text" id="searchQuery" placeholder="Search files" />

<input type="hidden" id="interval" value="600000" />

<input type="hidden" id="archiveReadyDownload" value="Your file is ready for download:" />

<input type="hidden" id="defAvatar" value="http://static.4shared.com/images/user2.png?ver=2906097813" />

<input type="hidden" id="zipAvatar" value="http://static.4shared.com/icons/32x32/zip.png?ver=655479399" />

<input type="hidden" id="b1Avatar" value="http://static.4shared.com/icons/32x32/b1.png?ver=703417425" />

<input type="hidden" id="torrentAvatar" value="http://static.4shared.com/icons/32x32/torrent.png?ver=1628575404" />

<input type="hidden" id="contactRequestText" value="Your friend $[p1] just joined 4shared." />

<input type="button" value="Ok" onclick="checkAndStartDownload(event);" style="width:80px" />

<input type="button" value="Cancel" onclick="hideTermsOfUse();" />

<input type="hidden" id="startTitle" value="Share" />

<input type="hidden" id="sharingFolderTitle" value="Share folder" />

<input type="hidden" id="sharingFileTitle" value="Share file" />

<input type="hidden" id="placeHolderEnterEmailAdresses" value="Enter names or e-mail addresses" />

<input type="hidden" id="dLinkPay" value="Direct link is available only for Premium Users.&lt;br&gt; Sign Up to premium account to get all 4shared Premium Features." />

<input type="hidden" id="premiumRequired" value="Premium account required!" />

<input type="hidden" id="hosted" value="Hosted at" />

<input type="hidden" id="fbInviteFolderTitle" value="I've shared a folder with you on 4shared. Find out what it is!" />

<input type="hidden" id="fbInviteFileTitle" value="I've shared a file with you on 4shared. Find out what it is!" />

<input type="hidden" id="contacts" value="Contacts" />

<input type="hidden" id="fb_share_folder_img" value="http://static.4shared.com/images/facebook/share_folder.png?ver=2422162001" />

<input type="hidden" id="fb_share_file_img" value="http://static.4shared.com/images/facebook/share_file.png?ver=1565381062" />

<input type="hidden" id="fb_redir_param" value="https://www.4shared.com/servlet/signin/facebook?fp=https://www.4shared.com/account/home.jsp" />

<input type="hidden" id="fileSuccessfullSent" value="Your file was successfully sent" />

<input type="hidden" id="folderSuccessfullSent" value="Your folder was successfully sent" />

<input type="hidden" id="fbRequestSharedText" value="I'd like to share $[p0] with you" />

<input type="hidden" id="fbSharingOff" value="null" />

<input type="hidden" id="fbInviteText" value="4shared.com - free web-based file sharing and storage." />

<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />

<input type="radio" class="writeFlag" name="permissions" value="write" />

<input class="lucida dark-gray selectable" id="simpleViewLink" type="text" readonly="readonly" />

<input type="text" id="emails" class="lucida f12 dark-gray tags gaClick" data-element="shF-2-1" name="emails" tabindex="3" />

<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />

<input type="radio" class="writeFlag" name="permissions" value="write" />

<input type="text" id="downloadFileLink" class="lucida f12 selectable" name="" tabindex="3" />

<input type="text" class="lucida f12 dark-gray selectable" name="" tabindex="4" value="" id="premiumDirectLink" />

<input type="text" class="lucida f12 selectable" id="fileHTMLembed" name="" tabindex="3" />

<input type="text" id="fileForumEmbed" class="lucida f12 selectable" name="" tabindex="4" />

<input type="text" class="lucida f12 selectable" id="fileEmbed" tabindex="5" />

<input class="lucida f12 dark-gray selectable" id="searchFriendsInput" type="text" placeholder="Search by name or e-mail address" />

<input id="tags_2" type="text" class="tags" />

<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />

<input type="radio" class="writeFlag" name="permissions" value="write" />

<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />

<input type="radio" class="writeFlag" name="permissions" value="write" />

<input type="text" class="lucida f12 ffshadow dark-gray" name="" tabindex="4" value="" id="subdomainInput" />

<input type="text" class="lucida f12 ffshadow dark-gray" name="" tabindex="3" value="" id="subdomainValue" readonly="true" />

<input type="hidden" id="allreadyPasswordProtectedMess" value="You can't set password for this folder, because the parent folder '$[1]' is password protected." />

<input type="hidden" id="passwordChangeConfirmTitle" value="Password Change" />

<input type="hidden" id="passwordChangeConfirmBody" value="Some child directory already password protected. &lt;br/&gt; Changing password of current directory will cause password overwrite on children's " />

<input type="hidden" id="confirmButtonMsg" value="Change" />

<input type="hidden" id="cancelButtonMsg" value="Cancel" />

<input type="text" class="passInput lucida f12" name="" tabindex="4" value="" id="passwordInput" />

<input type="password" class="passInput lucida f12" name="" tabindex="4" value="" id="changePasswordInput" readonly="true" />

<input type="hidden" id="previewLinkForEmbed" />

<input type="hidden" id="previewLinkForWidget" />

<input class="lucida f12 dark-gray" id="widget_width" type="text" style="width:30px;" />

<input class="lucida f12 dark-gray" id="widget_height" type="text" style="width:30px;" />

<input type="text" class="lucida f12 dark-gray selectable" name="" tabindex="3" id="htmlEmbed" />

<input type="text" class="lucida f12 dark-gray selectable" name="" tabindex="4" id="forumEmbed" />

<input type="text" value="http://www.4shared.com/android/i-EbooI0/batman_hd.html" readonly="readonly" onclick="this.focus();this.select()" class="field1 gaClick" data-element="16" dir="ltr" />

<input type="text" value="&lt;a href=&quot;http://www.4shared.com/android/i-EbooI0/batman_hd.html&quot; target=_blank&gt;batman hd.apk&lt;/a&gt;" readonly="readonly" onclick="this.focus();this.select()" class="field1 gaClick" data-element="17" dir="ltr" />

<input type="text" value="[URL=http://www.4shared.com/android/i-EbooI0/batman_hd.html]batman hd.apk[/URL]" readonly="readonly" onclick="this.focus();this.select()" class="field1 gaClick" data-element="18" dir="ltr" />

<input type="hidden" name="showComments" value="true" />

<input type="hidden" name="showPart" value="commentList" />

<input type="hidden" name="replyId" value="" />

<input type="hidden" id="norecaptcha" name="norecaptcha" value="" />

<input type="hidden" name="start" value="0" />

<input id="submitCommBtn" type="submit" value="Add New Comment" class="gaClick floatLeft f11 marginT10 round4 lucida no-line sendCommentButton" data-element="32" />

<input type="text" class="input-gray-big wide round4" id="recaptcha_response_field" name="recaptcha_response_field" style="width:250px" />

<input class="field2" id="submitCommBtn" type="submit" value="Confirm" />

<input type="text" name="fileName" value="4shared" class="xBox" />

<input type="hidden" name="newValue" value="" />

<input type="hidden" name="mode" value="" />

<input type="hidden" name="fid" value="3168935269" />

<input type="hidden" name="mode" value="3" />

<input type="hidden" name="fid" value="3168935269" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />

<input type="hidden" name="mode" value="3" />

<input type="hidden" name="fid" value="3168935269" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />

<input type="hidden" name="mode" value="3" />

<input type="hidden" name="fid" value="3168935269" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />

<input type="text" name="newValue" class="xBox" style="width:200px" />

<input type="hidden" name="mode" value="2" />

<input type="hidden" name="fid" value="3168935269" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12" onclick="quickEditCancel(1)" />

<input type="text" name="newValue" class="xBox" style="width:330px" onkeypress="return quickEditIsValidCharForFileName(event);" />

<input type="hidden" name="mode" value="10" />

<input type="hidden" name="fid" value="3168935269" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatRight marginL10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatRight" onclick="quickEditCancel();" />

<input type="hidden" name="mode" value="3" />

<input type="hidden" name="did" value="0" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />

<input type="text" name="searchName" style="width:250px;padding:1px 0" class="ajax-suggestion field gaClick" data-element="fs1" autocomplete="off" />

<input type="submit" name="submitButton" value="Search" class="button gaClick" data-element="fs3" />

<input type="hidden" name="searchmode" value="2" />

使用 try.jsoup.com 也没有产生像 Chrome 这样的输入类型,这表明它不是我的代码,而是 Jsoup。

阅读其他线程表明 Javascript 可能会在加载网页后更改 html。关于如何解决这个问题没有可行的答案。

我做错了什么,我该如何解决?

这是我获取完整 html 页面的代码:

Document doc = Jsoup.connect("http://www.4shared.com/get/i-EbooI0/batman_hd.html").timeout(0).get();
System.out.println(doc.toString() + "\n\n\n\n");
Elements links = doc.select("input[type=hidden]");
for (org.jsoup.nodes.Element link : links) {
    System.out.println(link);
}

在此处查看所需值的屏幕截图

在此处输入图像描述

解决方案

Connection.Response response = Jsoup.connect("myUrl")
    .method(Connection.Method.GET)
    .execute();

Document homePage = Jsoup.connect("myUrl")
    .cookies(response.cookies())
    .get();

此处描述的代码修改版本:用于 HTTPS 抓取的 Jsoup Cookies。这会按照 Niranjan 的建议获取 cookie,然后重新连接到您的 Url。


更简单的东西怎么样?

def isIn(x, y):
    return x in y or y in x

如果我们正在处理整个字符串并且我们有兴趣知道一个是否是另一个的一部分,则无需遍历它们的每个字符 - 这只会告诉您两个字符串中是否有一些字符。

现在,如果您真的需要知道两个字符串中是否有一些字符,这会很好:

def isIn(x, y):
    return any(c in y for c in x)
4

1 回答 1

5

JsoupHTML将在解析时清理您的内容,并且HTML尽管格式不正确,但它也可以处理您的内容。尝试在解析 ie 后转储 html,Document.html()并检查您丢弃的元素是否符合您的select子句的条件。

更新

给你,试试这个,如果可行,我会向你解释!

public static void main(String[] args) throws IOException
{

    try
    {
        Map<String, String> cookieMap = new HashMap<String, String>();
        cookieMap.put("day1host", "h");
        cookieMap.put("d1.loginity.mark", "1");
        cookieMap.put("hostid", "-1314014314");
        cookieMap.put("__qca", "P0-2042580316-1371938383086");
        cookieMap.put("cd1v", "OOhB");
        cookieMap.put("c29", "1");
        cookieMap.put("__utma", "210074320.280144312.1371938377.1371938377.1371938377.1");
        cookieMap.put("__utmb", "210074320.4.10.1371938377");
        cookieMap.put("__utmc", "210074320");
        cookieMap.put("__utmz", "210074320.1371938377.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)");


        Document document = Jsoup.connect("http://www.4shared.com/get/i-EbooI0/batman_hd.html")
        .userAgent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36")
        .followRedirects(true)
        .cookies(cookieMap)
        .get();
        //System.out.println(document.html());
        //System.out.println("====================================================================");
        Elements elements = document.select("input[type=hidden]");
        for (Iterator<Element> iterator = elements.iterator(); iterator.hasNext();)
        {
            Element element = iterator.next();
            System.out.println(element);

        }
    }
    catch (Exception e)
    {
        e.printStackTrace();
    }

}

解释

我不确定以下模式是否与URL您尝试的所有模式相同。

这就是网站的响应方式。

  1. 有一个从/get/i-EbooI0/batman_hd.html到的站点重定向android/i-EbooI0/batman_hd.html。在重定向它发送 2 个 cookie 以响应第一个请求时。

    第一个请求

  2. 第二个请求中的 cookie 很少。

    第二次请求

    目前还没有隐藏字段<body>。查看选项卡确认这一点Elements

  3. http://www.4shared.com/get/i-EbooI0/batman_hd.html现在在浏览器中请求。

    第三次请求

    Hidden fields现在您在 < 中拥有所需的内容body>

    在此处输入图像描述

Step 3直接在代码中执行。


结论 :

如果您也观察到其他人的相同行为,URL那么您必须编写代码来捕获cookiesa Response,然后在后续传递它们,Request直到您获得所需的Hidden fields

于 2013-06-22T18:09:47.783 回答