


尝试从网站检索页面源时,得到的文本与通过 Web 浏览器查看相同页面源时完全不同(且更短).

Trying to retrieve the page source from a website, I get a completely different (and shorter) text than when viewing the same page source through a web browser.


这个家伙有一个相关的问题,但获得了主页源而不是请求的源 - 我得到了完全陌生的东西.

This fellow has a related issue, but obtained the home page source instead of the requested one - I am getting something completely alien.


from urllib import request

def get_page_source(n):
    url = 'https://www.whoscored.com/Matches/' + str(n) + '/live'
    response = request.urlopen(url)
    return str(response.read())

n = 1006233
text = get_page_source(n)


This is the page I am targeting in this example: https://www.whoscored.com/Matches/1006233/live

有问题的 url 在页面源中包含丰富的信息,但我在运行上述代码时最终只得到以下内容:

The url in question contains rich information in the page source, but I end up getting only the following when running the above code:


b'<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX,
NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta 
name="viewport" content="initial-scale=1.0"><meta http-equiv="X-
UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;
height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=24&
xinfo=0-12919260-0 0NNY RT(1462118673272 111) q(0 -1 -1 -1) r(0 -1) 
B12(4,315,0) U2&incident_id=276000100045095595-100029307305590944&edet=12&
cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" 
marginwidth="0px">Request unsuccessful. Incapsula incident ID: 


What went wrong here? Can a server detect a robot even when it has not sent repetitive requests – if yes, how – and is there a way around?


这里有几个问题.根本原因是您尝试抓取的网站知道您不是真实的人并且正在阻止您.许多网站只是通过检查标题来查看请求是否来自浏览器(机器人)来做到这一点.但是,该站点看起来像是使用 Incapsula,旨在提供更复杂的保护.您可以尝试以不同的方式设置您的请求,以通过设置标题来欺骗页面上的安全性 - 但我怀疑这是否可行.

There are a couple of issues here. The root cause is that the website you are trying to scrape knows you're not a real person and is blocking you. Lots of websites do this simply by checking headers to see if a request is coming from a browser or not (robot). However, this site looks like they use Incapsula, which is designed to provide more sophisticated protection. You can try and setup your request differently to fool the security on the page by setting headers - but I doubt this will work.

import requests

def get_page_source(n):
    url = 'https://www.whoscored.com/Matches/' + str(n) + '/live'
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    response = requests.get(url, headers=headers)
    return response.text

n = 1006233
text = get_page_source(n)
print text

看起来该网站也使用验证码 - 旨在防止网络抓取.如果一个网站正在努力防止抓取 - 很可能是因为他们提供的数据是专有的.我建议寻找另一个提供此数据的网站 - 或尝试使用官方 API.

Looks like the site also uses captchas - which are designed to prevent web scraping. If a site is trying this hard to prevent scraping - it's likely because the data they provide is proprietary. I would suggest finding another site that provides this data - or try and use an official API.

看看这个(https://stackoverflow.com/a/17769971/701449)不久前的回答.看起来 whoscored.com 使用 OPTA API 来提供信息.您或许可以跳过中间人,直接前往数据源.祝你好运!

Check out this (https://stackoverflow.com/a/17769971/701449) answer from a while back. It looks like the whoscored.com uses the OPTA API to provide info. You may be able to skip the middleman and go straight to the source of the data. Good luck!



