python - 爬取这些特定站点并获取所有内部链接以进行抓取

Question

好的，这是我关于同一主题的第三个问题。我正在做一个项目，我将在其中分析孟加拉语中某些类型的用法和单词上下文。由于缺乏我需要的语料库可用性，我正在根据主题抓取和抓取多个站点。

我已经编写了一个用于爬行的代码，到目前为止，它几乎可以完美地在超过 20-25 个不同的孟加拉语网站/博客/新闻门户网站上运行。但要获得一些处理文学和故事/小说类语言的数据，我需要抓取一些这样的网站。我不明白为什么代码无法从以下五个网站中抓取所有“内部”链接的确切问题，我将在下面提到。内部是指仅属于网站域的网页，在网站内，而不是外部链接。

使用的库：

import requests
import urllib.parse
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import time
import random

我的爬取代码：（我做了很多处理错误和不丢失数据的工作；发表评论）

def crawl(url, max_urls=30):
    """
    Crawls a web page and extracts all links. Iterative function
    You'll find all links in `internal_urls` global set variable.
    params:
        max_urls (int): number of max urls to crawl, default is 30.
    """
    global total_urls_visited #Global crawled urls counter, initialized to zero
    global stack #Since it's an iterative function I maintain a stack
    global crawled #Track of all the links already crawled
    global stackLen #Length of Stack initialized to zero

    session = requests.Session() # I used Session because multiple links are being
                                                   # requested this kinda increased the speed

    #Certain website would not allow direct access without header.
    #This one seemed to work fine, with all.
    user_agent = "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"

    urlCur=url #Stores the current URL

    lastCount=0 #Error management, somehow some useless links kept returning
                        #links like ".../%252525..."" something in loop, 
                        # I don't know how to avoid that. 
                        #So I did this: if the last link being crawled too many times, just return
    while True:
        sesCheck=0 #Counts number of times Session will be called again

        print("\nCrawled Count: \t\t", total_urls_visited) 
        #Just prints the number of sites already crawled,
         # helps to keep track of what's going on, it's time-taking
        
        total_urls_visited += 1
        if stackLen == 1: #Here, that issue of repeated useless links
            lastCount +=1
        req = ''

        #Multiple network failure checks, connection can be bad here :(,
        # I have lost too much data on this
        checkError1 = 0
        checkError2 = 0
        checkError3 = 0
        checkError4 = 0

        while req == '':
            try:
                if (stackLen > 0) and (checkError4 > 15):
                    print("\n\nRepeated Error, endlessly. Abort.\n\n")
                    checkError4=0
                    return 0 #Value that tells that process couldn't be completed fully

                if (stackLen > 0) and (checkError3 > 10):
                    print("\n\nRepeated Error, sleep two minutes. Then skip.\n\n")
                    time.sleep(120)
                    #Certain sites won't let me request anymore midway,
                    #something about max limit

                    random.shuffle(stack)
                     # I always shuffled the stack before popping,
                     #since no bias towards what link will show up
                     # Also, no intentional starvation

                    urlCur=stack.pop()

                    session = requests.Session() 
                    # I don't if I should do this, but to not overload the server
                    #(I don't know if that's how it works)
                    # I request a new Session

                    stackLen-=1
                    checkError3=0
                
                if (stackLen > 0) and (checkError2 > 8):
                    print("\n\nRepeated Error, sleep for a minute. Then skip.\n\n")
                    time.sleep(60) #Same reason
                    random.shuffle(stack)
                    urlCur=stack.pop()
                    session = requests.Session() #Should I be doing this?
                    stackLen-=1
                    checkError2=0

                if (stackLen > 0) and (checkError1 > 3):
                    print("\n\nRepeated Error. Sleep for a minute.\n\n")
                    time.sleep(60) # It just gives me enough time to sometimes
                                            # switch off-on my modem,
                                            # to fix data issues 
                    session = requests.Session()
                    checkError1=0
                                    
                startTime = time.time() #Checks how long it will take to make the request

                req = session.get(urlCur, headers={'User-Agent': user_agent})

                endTime = time.time()
                
                # If it took more than a minute and this isn't happening often,
                # request new session (Is it required, I dealing with too many links)
                if (endTime - startTime > 60) and (sesCheck <= 2):
                    session = requests.Session()
                    sesCheck +=1

            except Exception as e: 
                print(e)
                checkError1+=1
                checkError2+=1
                checkError3+=1
                checkError4+=1
                print("Connection refused :(")
                print("5 seconds break")
                time.sleep(5)
                print("Let's go")
                continue

        if stackLen < 5: 
            print(f"[*] Crawling: {urlCur}") #Prints last few links, in case some error is gonna show up

        links = get_all_website_links(urlCur, req) #Gets all internal links

        for link in links:
            if link not in (stack + crawled): #If the link has been crawled or is already in stack ignore
                stack.append(link)
                stackLen += 1

        if (total_urls_visited > max_urls) or (stackLen==0) or (lastCount>5):
            # lastCount will check useless repeated links in the end
            return 1 #Link crawled completely
            
        random.shuffle(stack) #Again no bias or intended starvation
        crawled.append(urlCur) #Hence crawled

        print("Stack Size: \t\t", stackLen)  #Helps keeping track of what the program's doing, that's it   
        urlCur = stack.pop()
        stackLen -= 1

这是获取上面使用的所有内部链接的代码：

def get_all_website_links(url, req):
    global internal_urls #Set of All Internal URLS that have collected 
    internal_urls.add(url)
    try:
        urls = set() #All internal links under the given link
        article = req.content #Get's the content
        soup = BeautifulSoup(article, "lxml")
        domain_name = urlparse(url).netloc #Get the domain
        for a_tag in soup.findAll("a"): #Looks for all a tags
            href = a_tag.attrs.get("href")
            if href == "" or href is None:
                # href empty tag
                continue

            #Avoids useless links, I keep changing these specific to sites ---
            if "/tag/" in href:
                continue
            if "/tags/" in href:
                continue
            if "/wp-content/" in href:
                continue
            if "/wp-contents/" in href:
                continue
            if "/wp-admin/" in href:
                continue
            if "/amp/" in href:
                continue
            if href.endswith((".png", ".jpg", ".css", ".gif", ".pdf", ".ico", ".feed", ".json", ".js", ".svg", ".php")):
                continue
            if href.endswith((".png/", ".jpg/", ".css/", ".gif/", ".pdf/", ".ico/", ".feed/", ".json/", ".js/", ".svg/", ".php/")):
                continue
           
            # join the URL if it's relative (not absolute link)
            href = urljoin(url, href)
            parsed_href = urlparse(href)
            # remove URL GET parameters, URL fragments, etc.
            href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
            #If the joined link doesn't start with http skip
            if href.startswith("http") == False:
                continue
            if not is_valid(href):
                # not a valid URL
                continue
            if href in internal_urls:
                # already in the set
                continue
            if domain_name not in href:
                # external link
                continue
            #Hence Internal Link, since they are sets, there's no repetition
            urls.add(href)
            internal_urls.add(href)
        return urls
    except Exception as e: 
        print(e)
        print("None returned")
        return set(stack) #Return the Stack itself if the link were problematic

上面使用的检查链接是否有效的代码是这样的，

def is_valid(url):
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)

因此，此代码对于大多数域都非常有效，但不知何故，对于这些域，它似乎并没有获得所有链接或任何链接。我用过max_urls = 5000。抓取然后抓取这些网站对我来说很重要，因为我可以获得一个公正的语料库进行分析。

所以，链接是：（我打印了数字以跟踪）

https://www.kaliokalam.com/

www_kaliokalam_com_.txt
[+] Total Internal links: 79
[+] Total URLs: 80
[+] Total crawled URLs:79

而且这种方式的链接数少，网站好像也有。即使手动查看也足够了。

http://www.rupalialo.com/

www_rupalialo_com_.txt
[+] Total Internal links: 39
[+] Total URLs: 45
[+] Total crawled URLs:39

同样的问题。

https://www.maadhukari.com/

www_maadhukari_com_.txt
[+] Total Internal links: 31
[+] Total URLs: 75
[+] Total crawled URLs:31

再次。网站内肯定有更多的超链接。

https://www.tatkhanik.com/

www_tatkhanik_com_.txt
[+] Total Internal links: 25
[+] Total URLs: 49
[+] Total crawled URLs:25

https://www.sananda.in/

最后一个问题更大，也许有不同的问题。它根本不提供任何链接。

www_sananda_in_.txt
[+] Total Internal links: 1
[+] Total External links: 0
[+] Total URLs: 1
[+] Total crawled URLs:1

我想知道我的代码中的一些更改是否会帮助我抓取这些网站中的每一个。为什么它不像所有其他网站（很多！很多！）一样为他们工作。我打算获得所有带有文章/类似故事内容的站点内链接。

python - 爬取这些特定站点并获取所有内部链接以进行抓取

0 回答 0

Related

Reference