One. Three online Encyclopedia
With the rapid development of the Internet and big data, we need Work, such as the integration of multi -source knowledge bases, the construction of knowledge maps, and the establishment of computing engines. The representative knowledge map applications include Google's KNOWLEDGE GRAPH of , Facebookml6, Graph Search, Baidu's Baidu Zhixin, Sogou's Sogou Cube. These applications may be different, but the same is that they use WIKIATIA, Baidu Encyclopedia, Interactive Encyclopedia and other online encyclopedia during the construction process. Therefore, this chapter will teach everyone to climb these three online encyclopedia respectively.
Encyclopedia refers to the general term of the knowledge of all disciplines such as astronomy, geography, nature, humanities, religion, beliefs, literature, etc., it can be comprehensive, including related content in all fields; or professionalism. Next, we will introduce the three common online encyclopedia, which are one of the important corpus of information extraction research.
.WIKIPEDIA"Wikipedia is A Free Online EncyClopedia with the Aim to Alow Anyone to Edit ArticleS." This is the official introduction of WikiPedia. Wikipedia is a multi -language encyclopedia collaboration plan based on Wiki -based technology. The word Wikipedia is taken from the core technology "Wiki" and the "Encyclopedia" with the meaning of encyclopedia, which is jointly created by the encyclopedia.
In all online encyclopedia, Wikipedia has the best knowledge and the best structure, but Wikipedia originally focused on English knowledge, and very few Chinese knowledge involved. Online encyclopedia pages usually include: Title (title), Description (Abstract description), Infobox (message box), Categories (physical category), Crosslingual Links, etc. The Chinese page information of the physical "Huangguoshu Waterfall" in Wikipedia is shown in Figure 1.
Figure 1 The Wikipedia information shows:
- article title (Article) : The only sign of the article (except for the ambiguity page), that is, corresponding to a entity, " Huangguoshu Waterfall ".
- Summary (ABSTRACT) : It describes the entire article or the entire entity through one or two streamlined information, which has important use value.
- Free Text : Free text includes full -text content and part of text content. The content of the full text is to describe all the text information of the entire article, including the abstract information and the introduction of various parts of the information. Some text content is part of the text information describing an article, and users can customize picking.
- Category Tag (Category Label) : It is used to identify the types of the article, as shown in the figure "Huangguoshu Waterfall" includes "National 5A Tourist Scenic Area", "China Waterfall", "Guizhou Tourism", etc.
- message box (Infobox) : Also known as information module or information box. It uses a structured form to display web information to describe the attributes and attribute value information of articles or entities. The message box contains a certain number of "attribute-attribute values" pairs, gathered the core information of the article to characterize the entire webpage or entity.

Baidu Encyclopedia is an open and free network encyclopedia platform launched by Baidu Company . As of April 2017, Baidu Encyclopedia has included more than 14.32 million entries, and more than 6.1 million netizens participating in the entry editor cover almost all known fields of knowledge. Baidu Encyclopedia aims to create a Chinese information collection platform covering knowledge in various fields. Baidu Encyclopedia emphasizes user participation and dedication, fully mobilizes the power of Internet users, brings together the wisdom of the majority of users, and actively communicates and shared. At the same time, Baidu Encyclopedia achieves the combination of search with Baidu and Baidu knows to meet the user's needs for information from different levels.Compared with Wikipedia,
contains the most Chinese knowledge and the widest, but the accuracy is relatively poor. Baidu Encyclopedia pages also include: Title (title), Description (Abstract description), Infobox (message box), Categories (physical category), Crosslingual Links, etc. Figure 2 shows the knowledge of Baidu Encyclopedia "PYTHON" webpage. The message box of this webpage is the intermediate part and uses key-value pair. For example, the value of "foreign language name" is "Python", "classic textbook" The corresponding value is "Head First Python".

Interactive Encyclopedia (www.baike.com) is the pioneering and leader of the Chinese encyclopedia website. It is committed to providing a large amount of mass, comprehensive and timely encyclopedia for hundreds of millions of Chinese users, and through the new Wiki platform Continuously improve the user's creation, acquisition, and sharing method of information. As of the end of 2016, the Interactive Encyclopedia has developed into an encyclopedia website with 16 million words, 20 million pictures, and 50,000 micro -encyclopedia created by more than 11 million users. Users exceed 20 million. Compared with Baidu Encyclopedia,
has higher accuracy and better structured interactive encyclopedia. In the professional field, the knowledge quality is high. Therefore, researchers usually choose interactive encyclopedia as one of the main corpus. Figure 3 shows the homepage of interactive encyclopedia. The information of
Interactive Encyclopedia is divided into two forms of storage, one is a structured information box in the encyclopedia, and the other is the free text of the encyclopedia. For the entry articles in the encyclopedia, only a few words contain a structured information box, but all the entries contain free text. The information box is in the form of a structured way to display the entry information. An example of a typical encyclopedia information box display is shown in Figure 4, which shows the python information information. GUIDO VAN ROSSUM ".
is explained below Selenium technology to crawl three online encyclopedia. The analysis methods of the three encyclopedia are slightly different. Wikipedia first obtains links from the Group 20 (referred to as G20) from the list page, and then conducts webpage analysis and information climbing in turn; Baidu Encyclopedia calls Selenium to automatically operate, enter various programming language , and then access positioning climbing Take; Interactive Encyclopedia uses the linked URL of the webpage, and then go to different attractions for analysis and information capture.
two. SELENIUM crawl Baidu Encyclopedia Knowledge
Baidu Encyclopedia. As the largest Chinese online encyclopedia or Chinese knowledge platform, it provides knowledge of various industries for researchers to engage in research in all aspects. Although the accuracy of the entry is not the best, it can still provide a good knowledge platform for scholars engaged in data mining, knowledge map, natural language processing, big data and other fields.
. Webpage analysisThis section will explain in detail the example of Selenium crawling Baidu Encyclopedia. The theme of the climb is 10 national 5A -level scenic spots. Box information. The core steps of the webpage analysis are as follows:
(1) Call SELENIUM automatic search Baidu Encyclopedia
. First, call Selenium technology to visit Baidu Encyclopedia homepage. The website is "https://baike.baidu.com". Subject Home, the top is the search box. Enter related entries such as "Forbidden City" and click "Enter the entry" to get the detailed information of the Palace Palace.
Then, in the browser mouse selection of the "Enter the Words" button, right -click the mouse click "Examine Elements", you can check the HTML source code corresponding to the button, as shown in Figure 6. Note that the names of the web control or content corresponding code of different browsers are different. The figure uses the 360 security browser , which is called " review element ", and the CHROME browser is called "inspection". QQ browser is called "inspection" and so on.
"Enter the entry" corresponding HTML core code is shown below:
div class = "form ID =" searchform "acion ="/search/word "medhod = "Get" input id = "query" nslog = "normal "name =" word "type =" text "autocomplete =" Off "AutoCorrect =" Off "Value =" Button ID = "Search" nslog = "NORMAL" Type = "Button" ID = "Searchlemma "NSLOG =" NORMAL "Type =" Button "Search/Buttona Class =" Help "HREF ="/Help "NSLOG =" Normal "to help /a/Form .../div
htmm L0 call selenium function You can get the input box input control.
find_element_by_xpath ("// form [@ID = 'searchform']/i nput")
and automatically enter the "Forbidden City", get the button "Enter the entry" and automatically click here. That You can access the "Forbidden City" interface. The core code is shown below:
driver.get ("http: //baike.bai du.com/")Elem_inp=driver.find_element_xpath="//form='Sear chform ']/Input ") ELEM_INP.SEND_KEYS (name) ELEM_INP.SEND_KEYS (keys.return)
(2) Visit the" Forbidden City "page and locate the first step of the message box
step. After completion, enter the "Forbidden City" page and find the INFOBOX part of the intermediate message box , Right -click the mouse and click "Examine Elements", and the return result is shown in Figure 7.
Message Box core code is as follows:
message box mainly uses the form of attribute-attribute values to store information in detail. For example, the corresponding value of the attribute "Chinese name" is " Beijing Forbidden City " and the attribute "foreign language name" corresponding value is "Fobidden City". The corresponding HTML part of the source code is as follows.
div class = "Basic-Info J-Basic-Info CMN-CLLASS =" Basicinfo-Basicinfo-Left "DT Class =" Basicinfo-ITEM " Chinese Name/DTDD Class = "Basicinfo-ITEM VALUE" Beijing Forbidden City /dddt class = "Basicinfo-ITEM name" Foreign Language Name/DTDD Class = "Basicinfo-ITEM VALUE" Forbidden City/DDDT Class = "Basicinfo-ITEM Name" category/dtdd = " BASICINFO-Wem Value " World Cultural Heritage , History, History Museum /DD/DL ... DL Class = "Basicinfo-BLOCK BASICINFO-RIGHT" DT Class = "Basicinfo-ITEM Name"/DTDD = "Basicinfo ITE m Value "about about The famous attraction of the square meter/dddt class = "Basicinfo-ITEM name" three major halls of "Basicinfo-ITEM", , Huangji Hall /DD/DD/DD/DD dl ... /div
The entire message box is located in the div class = "Basic-Info J-Basic-Irfo CMN-CLLARFIX" tag. The next one is the HTML tag in DL, DT, DD, in which the message box DIV layout includes two DL .../// DL layout, one is the content of the left part of the message box, the other DL records the content of the right part of the message box, and the attributes and attribute values are defined in each DL tag, as shown in Figure 8.
Note: Use DT and DD's outermost layers must be wrapped in DL, DL tags define the definition list, the project in the DT tag definition list, and the item in the DD tag description. Table table combination tags are similar.
then calls the Find_ELEMELEMENTS_BY_XPATH () function of the selenium expansion package, respectively. The function returns multiple attributes and attribute values, and then outputs multiple elemental values that have been located through the for circulating.The code is as follows:
elem_name = driver.find_elements_by_xpath ("// div e = driver.find_elements_by_xpath ("// div [@Class = 'BASIC-Info J-Basic-Info CMN-CLLARFIX']/DL/DD ") for E in ELEM_NAME: Print (E.Text) for ELEM_VALUE: Print (E.Text)
. Selenium technology The analysis method of crawling Baidu Encyclopedia 5A -level scenic spots is explained. Below is this complete code and some difficulties.

Note. Next, we try to define multiple Python files to call each other to achieve reptile function.完整代码包括两个文件,即:
- test10_01_baidu.py:定义了主函数main并调用getinfo.py文件
- getinfo.py:通过getInfobox()函数爬取消息盒
test10_01_baidu.py
# -*- coding: utf-8 - *-"" Test10_01_baidu.py defines the main function main and calls the getinfo.py file by: Eastmount CSDN 2021-06-23 "" Import CodeCSIMPONFO#引#Main function DEF Main ():#file read attractions Information source = open ('data.txt', 'r', enCoding = 'UTF-8') for name in Source: Print (name) getinfo.getinfobox (name) Print ('End Read!') CE.Close () if __name__ == '__main __': main ()
calls the "Import Getinfo" code in the code to import the Getinfo.py file. Then we Call the getinfobox () function in the getinfo.py file to perform the operation of the climbing message box.
getinfo.py
# coding = UTF-8 "" Getinfo.py: Eastmount CSDN 2021-06-23 "Import OSIM Port CodeCSIMPORT TIMEFROM SELENIUM Import WebdriverFrom Selenium.webdriver.Ceys Import Keys#Getinfobox function: Get the national 5A -level scenic area message box DEF GETINFOBOX (name): TRY:#获 获 百度 获 and automatically search for driver = webdriver.firefox () driver.get ("http://baike.baidu.com/") eLem_inp = Driver. Find_element_by_xpath ("// form [@ID = 'searchform']/input") Elem_inp.send Current_url) Print (Driver.title) #Crane message box incobox content elem_name = driver.find_elements_by_xpath ("// div [@class = 'basic-info j-basic-infox]/dl/dt" Iver.find_elements_by_xpath ("// div ""#"" "Elem_dic = Dict (zip (eLEM_NAME, ELEM_VALUE)) for key in eLEM_DIC: Print (key.text, eLEM_DIC [key] .sleep (5) RetuRNEXC Ept Exception as e: Print ("Error:", E) Finally: Print ('\ n') Driver.close ()
. For example, the crawling process Firefox will automatically search for the "Forbidden City" page, as shown in the figure below:
content is as follows:
https: // baike. Baidu.com/item/ Beijing Palace Museum in Beijing 97%E4%BA%AC% E6%95%85%E5%AE%AB Beijing Palace Museum_ 百度 贴 吧 Chinese Palace Palace Geographical location No. 4, Jingshan Qianqian Street, Beijing [91] open time 4.1-10.31: 08: 20-17: 16:00, the latest entering the garden 16:10); 11.1-3.31: 08: 30-16: 30 (stop ticketing 15:30, the latest entering the park 15:40); ] [91] Attractions level AAAAA -level ticket price 60 yuan/40 yuan off -season [7] covers an area of 720,000 square meters (construction area of about 150,000 square meters) to protect the world cultural heritage; the first batch of national key cultural relics protection units Approval unit UNESCO ; The State Council of the People's Republic of China batch number III-100 Main Collection Qingming Shanghe Tu, Qianlong Golden Yonggu Cup , African Barfaret Barfielder [8] official phone 010-85007057 [92 92 ]
python's running results are shown below, where the data.txt file includes several common attractions.
- Beijing Palace Museum
- Huangguoshu Waterfall
- Summer Palace
- Bada Ling Great Wall
- Ming thirteen tomb
- Gongwangfu HT HT
- Beijing Olympic Park
- Huangshan
The above code attributes and attribute values are output through the dictionary. The core code is as follows: T (zip (elem_name , ELEM_VALUE)) for Key in ELEM_DIC: Print (key.text, eLEM_DIC [key] .text)
at the same time, readers can try to call the local non -interface browser Phantomjs for crawling. The call method is as follows:
webdriver.phantomjs (Executable_path = "C: \ ... \ Phantomjs.exe")
Course Operation:
The author taught everyone to climb the message box. At the same time These corpus will be a must -have for your subsequent text mining or NLP field, such as text classification, physical alignment, physical disability, and knowledge map construction.
3. SELENIUM crawl wikipedia
online encyclopedia is the largest amount of data in the Internet. These data have a certain structure and are semi -structured data. As well as Interactive encyclopedia. First of all, the author will introduce the instance of the Wikipedia climbing Wikipedia.
. Webpage analysis ofThe first instance author will explain in detail Selenium to crawl the first paragraph of the 20 National Group (G20). The specific steps are as follows:
(1) Get the overlink
2 from the G20 list page. 0 national group list As follows, Wikipedia uses the first letter of Chinese words in English to sort, such as "JAPAN", "Italy", "BRAZIL", etc. Each country uses a hyperlink form to jump.
https: //en.wikipedia.org/wiki/category: G20_nations
, you need to obtain the hyperlinks of 20 countries, and then go to the specific page for crawling. Select the hyperlinks of a country, such as "China", right -click the mouse and click the "Check" button to get the corresponding HTML source code, as shown below.
, the hyperlink is located in the UL Li A node of the layout of "MW-Category-GROUP". Corresponding code:
div Class = "MW-PAGES" DIV =. "en" dir = "ltr" class = "mw- Content-ltr "div class =" mw-category "div class =" mw-category-found "H3CH3ULLIA HREF ="/Wiki/China "Title =" China "China/A/Li/Ul/DivDiv Class = "MW- Category-found ".../divDIV class =" mw-category-found ".../div .../div/div/div
call selenium's find_elements_by_xpath () function to get the node cl The ASS attribute is "MW-Category-GROUP "The hyperlink, it will return multiple elements. The core code of the positioning hyperlink is as follows:
driver.get ("https://en.wikipedia.org/w iki/category: g20_nations") elem = driver.find_elements_by ("// div @class = 'mw-category-found ']/ul/li/a ") for e in eLEM: Print (e.text) prop (e.Get_attribute (" href "))
function found_elements_by_xpt () first analyze the DOM tree structure of html and position it. To the specified node And get its element. Then define a FOR cycle , and obtain the content of the node and the HREF attribute in order. Among them, E.Text represents the content of the node, such as the content between the nodes below is China.
a href = "/wiki/China" title = "China" China/A
at the same time, E.Get_attribute ("HRRF") means the corresponding attribute value of the client attribute HREF, that is, "/wiki/c, hina ", the same reason, e. get_attribute ("Title") can obtain the title title attribute and get the value "China".
is stored in the variables as shown below at this time, and then positions in each country and obtains the required content.
(2) Call Selenium positioning and crawl the pages of various countries. As shown in the figure, you can see the page URL, title, abstract, content, message box, etc. Among them, the message box is on the right part of the way, including the full name and location of the country.
is described by the form of attribute-attribute values below. It is very concise and precisely summarized a web entity, such as the capital-Beijing, population-1.3 billion people. Generally, after obtaining this information, you need to perform a pre -processing operation before you can analyze the data. The following chapters will be explained in detail. After
visits the page of each country, the first introduction of each country needs to be obtained. The content of crawler explained in this section may be relatively simple, but the method of explanation is very important, including how to locate nodes and crawl knowledge about climbing knowledge Essence The core code of the corresponding HTML core code of the details page is as follows:
div class = "mW-PARSER-OUTPut" div lole = "note" class = "hatnote navigation-searchable" .../Divdiv Role. = "Note" class = "Hatnote navigation-not-searchable ".../Divtable class =" Infobox Gegraphy Vcard ".../Tablepbchina/B, Officially TheBPEOPLE's Republic of china/b ..../pp .../pp .../pp .../pp .../pp .../pp .../pp .../pp .../pp .../pp .../pp ... Then, then ../table/div/div/div
browser review element method is shown in the figure.
The content is located under the DIV node of the attribute class "MW-PARSER-OUTPUT". In HTML, P tags represent paragraphs, which are usually used to identify the text, and the B tag means bold.Get the first section of the content, to locate the first P node. The core code is as follows:
driver.get ("https://en.wikipedia.org/wiki/") Elem = Driver.find_element_Xpath ("// Div [@class = 'mw-parser- output ']/p [2 ] "). TextPrint Elem
Note that the first paragraph of the text is located in the second P paragraph, so you can get P [2]. At the same time, if the reader wants to obtain a message box from the source code, the position of the message box needs to be obtained and the data is captured. The content of the information box (Infobox) corresponds to the following nodes in HTML, which records the core information of the web entity.
table class = "Infobox Gegraph" .../Table
. Code implementation of
full code reference file test10_02.py, as shown below:
# coding = UTF-8# by: EastMount CSDN 2021-06-23import Timeimport Reimport OSFRENIUM Import WebdriverFrom Selenium.webdriver.Common.keys Import Keysdriver = Webdriver.firefox () Driver.get ("https://en.wikiped/wiki/category:g2020 _nations ") ELEM = Driver.find_elements_by_xpath (" // DIV [ @class = 'mw-category-found']/ul/li/a ") name = [] #National name urls = [] #National hyperlink #Crazy link for e in Elem: Print (E.Text) PRINT (E.get_attribute ("href")) name.append (e.text) urls.append (e.get_attribute ("href")) print (name) print (urls)#Crazy content for url in urls: driver. get (url) eLEM = draff As shown in the figure. 
ps: You can simply try this part, and it is recommended to crawl Baidu Encyclopedia, Interactive Encyclopedia and Sogou Encyclopedia.
4. SELENIUM crawling Interactive Encyclopedia
has passed in the past. Interactive encyclopedia has become an encyclopedia, but fortunately, the webpage structure has not changed.

. Webpage analysis At present, online encyclopedia has developed as important sources of semantic analysis, knowledge map construction, natural language processing, search engine and artificial intelligence. As one of the hottest online encyclopedias, interactive encyclopedia provides researchers with strong corpus support.
will explain the abstract information of the ten most popular programming language pages climbing interactive encyclopedia. Through this instance, the reader's impression of using the technology of using Selenium crawlers to analyze the analysis techniques of network data crawling more deeply. Different from Wikipedia first crawling the word list list of the word list, then climbing the required information, Baidu Encyclopedia input entry into the relevant page and then targeted climbing. To climb information to the detailed interface of the entry.
has a certain rule due to the interactive encyclopedia's search for different words, that is, the "common URL+search entry name" method is used to jump. Here we set different entry web pages through this method. The specific steps are as follows:
(1) call selenium analyzes URL and search for interactive encyclopedia
. We first analyze some rules of interactive encyclopedia, such as the search character "Guizhou". // www. Baike.com/wiki/ Guizhou
corresponding page is shown in the figure. From the figure, it can be seen that the top hyperlink URL, the entry of "Guizhou", and the first paragraph of "Guizhou", "The right side is the corresponding side is the corresponding side Pictures such as pictures and other information


.
can get a simple rule, namely:
http: // www .baike.com/wiki/entry
can search for the corresponding knowledge, such as the programming language "Java" corresponding to:
http: //www.baike.com/wiki/java
0 (2) Visit the popular TOP10 programming language and crawl abstract信息2016年,Github根据各语言过去12个月提交的PR数量进行排名,得出最受欢迎的Top10编程语言分别是:JavaScript、Java、Python、Ruby、PHP、C++、CSS、C#、C和GO language.

then needs to be distributed to obtain the abstract information of these ten languages. Select the summary part in the browser, right -click the mouse click "Examine Elements" and return the result as shown in the figure. You can see the corresponding HTML source code corresponding to the abstract part at the bottom.

new version of the "Quick Encyclopedia" content is shown in the figure below:

"Java" entry summary part of the corresponding HTML core code as shown below:
DIV CLA SS = "Summary" div class = "content-p" span class = " "Java is an object -oriented/spana href ="/wikiid/76015795959865866248? From = wiki_content "class =" "clickLog =" baike_search_clink "span class =" "" Language/span/aspan class = "", not only absorbed C ++ The various advantages of the language also abandon the concepts of multiple inheritance and pointers that are difficult to understand in C ++. Therefore, the Java language has two characteristics of powerful and simple and easy to use. As the representative of the static object -oriented language of the Java language, the Java language has excellently realized the object -oriented theory, allowing programmers to perform complex programming in an elegant way of thinking. /span/divDIV class = "content-p" span class = "" Java is simple, distributed,/span .../div/div
to call SELENIUM's Find_ELEMENT_BY_XPATH () function , You can get the summary paragraph information, The core code is as follows.
driver = webdriver.firefox () url = "http://www.baike.com/wiki/" + namedriver.get (url) Elem = driver.find_element_by_xpath ( "// div [@class = 'sumary']/div /span")Print_text)
Basic steps are:
- first calls webdriver.firefox () driver to open the fire fox browser.
- analyzes web hyperlinks and calls the Driver.get (url) function access.
- analysis webpage DOM tree structure, call driver.find_element_by_xpath () for analysis.
- output results, the content of some websites needs to be stored in the local area, and needs to be filtered out of unnecessary content.
is a complete code and detailed explanation below.
. Code implementation of full code is blog10_03.py as shown below. Main function main () Call the getgetabstract () function to crawl the top10 programming language abstract information.
# coding = UTF-8# by: Eastmount CSDN 2021-06-23import Osimport Codecsfrom Selenium Import Webdriverfrom.webdriver.Common.key.key s Import Keysdriver = Webdriver.Firefox ()#Get the abstract information DEF GETABSTRACT (name): TRY:# The new folder and file BasepathDirectory = "Hudong_CODING" if not os.path.exists (BasepathDirectory): OS.Makedirs (BasepathDirectory) baidufile = OS.Path.join (BASE PathDirectory, "HudongSpider.txt")#file does not exist in newly built, existing Then write it to if not os.path.exists (baidufile): info = codecs.open (baiduFile, 'W', 'UTF-8') Else: Info = Codecs.open (baidufile, 'a', 'UTF- 8 ') URL = "http://www.baike.com/wiki/" + nameprint (url) driver.get (url) eLEM = driver.find_elements ]/div /span ") Content =" "for e in eLEM: Content += E.TextPrint (Content) Info.writlines (Content +'\ R \ n') Except Exception as e: Print (" Error: ", Finally: Print ('\ n') Info.write ('\ r \ n')#Main function def mail (): Languages = ["javascript", "java", "python", "ruby", "pHP", "" C ++ "," CSS "," C#"," C "," GO "] PRINT ('Starting Crane') for Lg in Languages: Print (Lg) getabstract (Lg) Print ('End Crane') if __name___ == '__main __': main ()
. Among the scratch results of "JavaScript" and "Java" programming language, the code of this paragraph has climbed the summary information of the popular ten languages in the interactive encyclopedia. The

program has successfully captured the summary information of each programming language. As shown in the figure below:

also stores the data to the local TXT file, which will be effective to support the analysis of NLP and text mining.

is written here. Several common encyclopedia capture methods are introduced. I hope you like it.
5. This chapter is summary of
Online Encyclopedia is widely used in scientific research work, knowledge map and search engine construction, data integration of large and small companies, and web2.0 knowledge base. The language version and other characteristics are loved by scientific researchers and company developers. Common online encyclopedias include Wikipedia, Baidu Encyclopedia, and Interactive Encyclopedia.
This article combines Selenium technology to crawl Wikipedia's paragraph content, Baidu Encyclopedia's message box, and interactive encyclopedia abstract information, and uses three analysis methods. I hope that readers can master SELENIUM technology to climb the webpage through the case of this chapter.
- message box crawling
- text summary
- web page multiple jump methods
2 - webpage analysis and climb core code
- file saving
seleneium The pan -field is automated testing. It runs directly in the browser (such as Firefox, Chrome, IE, etc.), just like real users, conduct a variety of tests on the development of web pages, and it is a must -have tool for automation testing. It is hoped that readers can master the climbing method of this technology, especially the target webpage needs to verify login and other situations.
This series of code download links:
https: //github.com/eastmountyxz/python-zero2ONE
Last
more reference wonderful blog post Please see here: @请 请 请 You can add a attention, you can follow, you can pay attention, Like a praise, continue to update, hehe! ————————————————
版权声明:本文为CSDN博主「Eastmount」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及This statement.
Original link: https://blog.csdn.net/eastmount/article/details/118147562