Suppose I crawled such an html web page, the front-end code in front is as follows:
html xmlns="http://www.w3.org/1999/xhtml" head titleOK resource collection-the latest film and television resource collection/title meta http-equiv="Content-Type" content="text/html; charset=utf-8" / meta name="keywords" content="OK resource collection-the latest film and television resource collection" / meta name="description" content="OK resource collection-the latest film and television resource collection" /scriptvar SitePath='/',SiteAid='10',SiteTid='',SiteId='';/scriptlink href="/template/okokzy/css/home.css" rel="stylesheet" type="text/css" /script src="/template/okokzy/js/jquery_ldg.js"/script src="/template/okokzy/js/jquery.zclip.min.js"/script src="/template/okokzy/js/ldg.js"/script src="/template/okokzy/js/ldg.js"/script src="/js/jq/jquery.lazyload.js"/script src="/template/okokzy/js/home.js"/script src="/template/okokzy/js/home.js"/script /head
We want to get the content behind the script tag, so we can use Xpath to get it. Assuming that we want to get the value at the first script tag, we can use the expression:
Xpath (/html/head/script/text())[0]
This expression means outputting the text at the first script tag under the head under html, because our previous code:
Xpath (/html/head/script/text())
will output all objects starting with script, so plus [0] is limited to the text after the first script.
Therefore, the output is:
var SitePath='/',SiteAid='10',SiteTid='',SiteId='';
In the Xpath expression '//' means that the previous omission is omitted, and directly skip two or more layers to get the corresponding object in the subsequent tag.
2. Use of tag attributes
Suppose we want to crawl the text content with a certain color attribute in the font tag, as shown below:
font color="#000000"OK resource station /fontfont color="#FF0000"HTTPS/fontfont color="#000000" Station Please enter /font/a/fontfont There are obviously many different colors after the tag size=
font, but we only want the text content after the color is "#000000", so we use this expression:
r_two=tree.xpath('//font[@color="#000000"]/text()')
This way you're hungry. The overall code is as follows:
import requestsfrom lxml import etreepost_url = 'https://www.okzy10.com/'#cookie='lastCity=100010000; __zp_stoken__=ce26bZyQcLhoDK1A7M0RzPzMQEDJzHHpAQCJkUHtpSSFDSCkNeko0HBZxSywqeBxlHh8PIE4CLwgTSWsacwcdbEMNUBBzE2APASkfAktgOFskSn9HCTgkLmE7GFxecS8MGE4FGX99IHdsQHV5YQ%3D%3D; __c=1610949395; __g=-; __l=l=%2Fwww.zhipin.com%2F&r=https%3A%2F%2Fwww.google.com%2F&g=&s=3&friend_source=0&s=3&friend_source=0; __a=13532184.1600828409.1610683874.1610949395.205.23.3.205; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1610082805,1610683875,1610949395,1610949407; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1610949407'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',}#Parse local html#Use etree.parse#Parse html#Use etree.HTMLreq= requests.get(post_url,headers=headers)#print(req.text) with open('ok_resource.html','w',encoding='utf-8') as fp:fp.write(req.text)# Use xpath expression to parse the etree object parser = etree.HTMLParser(encoding="utf-8")tree=etree.parse('ok_resource.html',parser=parser)r=tree.xpath('/html/head/script/text()')[0]r_two=tree.xpath('//font[@color="#000000"]/text()')print(r)print(r_two)