好的,这是我关于同一主题的第三个问题。我正在做一个项目,我将在其中分析孟加拉语中某些类型的用法和单词上下文。由于缺乏我需要的语料库可用性,我正在根据主题抓取和抓取多个站点。
我已经编写了一个用于爬行的代码,到目前为止,它几乎可以完美地在超过 20-25 个不同的孟加拉语网站/博客/新闻门户网站上运行。但要获得一些处理文学和故事/小说类语言的数据,我需要抓取一些这样的网站。我不明白为什么代码无法从以下五个网站中抓取所有“内部”链接的确切问题,我将在下面提到。内部是指仅属于网站域的网页,在网站内,而不是外部链接。
使用的库:
import requests
import urllib.parse
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import time
import random
我的爬取代码:(我做了很多处理错误和不丢失数据的工作;发表评论)
def crawl(url, max_urls=30):
"""
Crawls a web page and extracts all links. Iterative function
You'll find all links in `internal_urls` global set variable.
params:
max_urls (int): number of max urls to crawl, default is 30.
"""
global total_urls_visited #Global crawled urls counter, initialized to zero
global stack #Since it's an iterative function I maintain a stack
global crawled #Track of all the links already crawled
global stackLen #Length of Stack initialized to zero
session = requests.Session() # I used Session because multiple links are being
# requested this kinda increased the speed
#Certain website would not allow direct access without header.
#This one seemed to work fine, with all.
user_agent = "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"
urlCur=url #Stores the current URL
lastCount=0 #Error management, somehow some useless links kept returning
#links like ".../%252525..."" something in loop,
# I don't know how to avoid that.
#So I did this: if the last link being crawled too many times, just return
while True:
sesCheck=0 #Counts number of times Session will be called again
print("\nCrawled Count: \t\t", total_urls_visited)
#Just prints the number of sites already crawled,
# helps to keep track of what's going on, it's time-taking
total_urls_visited += 1
if stackLen == 1: #Here, that issue of repeated useless links
lastCount +=1
req = ''
#Multiple network failure checks, connection can be bad here :(,
# I have lost too much data on this
checkError1 = 0
checkError2 = 0
checkError3 = 0
checkError4 = 0
while req == '':
try:
if (stackLen > 0) and (checkError4 > 15):
print("\n\nRepeated Error, endlessly. Abort.\n\n")
checkError4=0
return 0 #Value that tells that process couldn't be completed fully
if (stackLen > 0) and (checkError3 > 10):
print("\n\nRepeated Error, sleep two minutes. Then skip.\n\n")
time.sleep(120)
#Certain sites won't let me request anymore midway,
#something about max limit
random.shuffle(stack)
# I always shuffled the stack before popping,
#since no bias towards what link will show up
# Also, no intentional starvation
urlCur=stack.pop()
session = requests.Session()
# I don't if I should do this, but to not overload the server
#(I don't know if that's how it works)
# I request a new Session
stackLen-=1
checkError3=0
if (stackLen > 0) and (checkError2 > 8):
print("\n\nRepeated Error, sleep for a minute. Then skip.\n\n")
time.sleep(60) #Same reason
random.shuffle(stack)
urlCur=stack.pop()
session = requests.Session() #Should I be doing this?
stackLen-=1
checkError2=0
if (stackLen > 0) and (checkError1 > 3):
print("\n\nRepeated Error. Sleep for a minute.\n\n")
time.sleep(60) # It just gives me enough time to sometimes
# switch off-on my modem,
# to fix data issues
session = requests.Session()
checkError1=0
startTime = time.time() #Checks how long it will take to make the request
req = session.get(urlCur, headers={'User-Agent': user_agent})
endTime = time.time()
# If it took more than a minute and this isn't happening often,
# request new session (Is it required, I dealing with too many links)
if (endTime - startTime > 60) and (sesCheck <= 2):
session = requests.Session()
sesCheck +=1
except Exception as e:
print(e)
checkError1+=1
checkError2+=1
checkError3+=1
checkError4+=1
print("Connection refused :(")
print("5 seconds break")
time.sleep(5)
print("Let's go")
continue
if stackLen < 5:
print(f"[*] Crawling: {urlCur}") #Prints last few links, in case some error is gonna show up
links = get_all_website_links(urlCur, req) #Gets all internal links
for link in links:
if link not in (stack + crawled): #If the link has been crawled or is already in stack ignore
stack.append(link)
stackLen += 1
if (total_urls_visited > max_urls) or (stackLen==0) or (lastCount>5):
# lastCount will check useless repeated links in the end
return 1 #Link crawled completely
random.shuffle(stack) #Again no bias or intended starvation
crawled.append(urlCur) #Hence crawled
print("Stack Size: \t\t", stackLen) #Helps keeping track of what the program's doing, that's it
urlCur = stack.pop()
stackLen -= 1
这是获取上面使用的所有内部链接的代码:
def get_all_website_links(url, req):
global internal_urls #Set of All Internal URLS that have collected
internal_urls.add(url)
try:
urls = set() #All internal links under the given link
article = req.content #Get's the content
soup = BeautifulSoup(article, "lxml")
domain_name = urlparse(url).netloc #Get the domain
for a_tag in soup.findAll("a"): #Looks for all a tags
href = a_tag.attrs.get("href")
if href == "" or href is None:
# href empty tag
continue
#Avoids useless links, I keep changing these specific to sites ---
if "/tag/" in href:
continue
if "/tags/" in href:
continue
if "/wp-content/" in href:
continue
if "/wp-contents/" in href:
continue
if "/wp-admin/" in href:
continue
if "/amp/" in href:
continue
if href.endswith((".png", ".jpg", ".css", ".gif", ".pdf", ".ico", ".feed", ".json", ".js", ".svg", ".php")):
continue
if href.endswith((".png/", ".jpg/", ".css/", ".gif/", ".pdf/", ".ico/", ".feed/", ".json/", ".js/", ".svg/", ".php/")):
continue
# join the URL if it's relative (not absolute link)
href = urljoin(url, href)
parsed_href = urlparse(href)
# remove URL GET parameters, URL fragments, etc.
href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
#If the joined link doesn't start with http skip
if href.startswith("http") == False:
continue
if not is_valid(href):
# not a valid URL
continue
if href in internal_urls:
# already in the set
continue
if domain_name not in href:
# external link
continue
#Hence Internal Link, since they are sets, there's no repetition
urls.add(href)
internal_urls.add(href)
return urls
except Exception as e:
print(e)
print("None returned")
return set(stack) #Return the Stack itself if the link were problematic
上面使用的检查链接是否有效的代码是这样的,
def is_valid(url):
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)
因此,此代码对于大多数域都非常有效,但不知何故,对于这些域,它似乎并没有获得所有链接或任何链接。我用过max_urls = 5000
。抓取然后抓取这些网站对我来说很重要,因为我可以获得一个公正的语料库进行分析。
所以,链接是:(我打印了数字以跟踪)
www_kaliokalam_com_.txt
[+] Total Internal links: 79
[+] Total URLs: 80
[+] Total crawled URLs:79
而且这种方式的链接数少,网站好像也有。即使手动查看也足够了。
www_rupalialo_com_.txt
[+] Total Internal links: 39
[+] Total URLs: 45
[+] Total crawled URLs:39
同样的问题。
www_maadhukari_com_.txt
[+] Total Internal links: 31
[+] Total URLs: 75
[+] Total crawled URLs:31
再次。网站内肯定有更多的超链接。
www_tatkhanik_com_.txt
[+] Total Internal links: 25
[+] Total URLs: 49
[+] Total crawled URLs:25
最后一个问题更大,也许有不同的问题。它根本不提供任何链接。
www_sananda_in_.txt
[+] Total Internal links: 1
[+] Total External links: 0
[+] Total URLs: 1
[+] Total crawled URLs:1
我想知道我的代码中的一些更改是否会帮助我抓取这些网站中的每一个。为什么它不像所有其他网站(很多!很多!)一样为他们工作。我打算获得所有带有文章/类似故事内容的站点内链接。