"Wikipedia is a free online enclopeedia with the aim to allow anyone to edit articles." This is the official introduction of Wikipedia.

2025/01/2821:07:45 hotcomm 1301

One. Three online Encyclopedia

With the rapid development of the Internet and big data, we need Work, such as the integration of multi -source knowledge bases, the construction of knowledge maps, and the establishment of computing engines. The representative knowledge map applications include Google's KNOWLEDGE GRAPH of , Facebookml6, Graph Search, Baidu's Baidu Zhixin, Sogou's Sogou Cube. These applications may be different, but the same is that they use WIKIATIA, Baidu Encyclopedia, Interactive Encyclopedia and other online encyclopedia during the construction process. Therefore, this chapter will teach everyone to climb these three online encyclopedia respectively.

Encyclopedia refers to the general term of the knowledge of all disciplines such as astronomy, geography, nature, humanities, religion, beliefs, literature, etc., it can be comprehensive, including related content in all fields; or professionalism. Next, we will introduce the three common online encyclopedia, which are one of the important corpus of information extraction research.

.WIKIPEDIA

"Wikipedia is A Free Online EncyClopedia with the Aim to Alow Anyone to Edit ArticleS." This is the official introduction of WikiPedia. Wikipedia is a multi -language encyclopedia collaboration plan based on Wiki -based technology. The word Wikipedia is taken from the core technology "Wiki" and the "Encyclopedia" with the meaning of encyclopedia, which is jointly created by the encyclopedia.

In all online encyclopedia, Wikipedia has the best knowledge and the best structure, but Wikipedia originally focused on English knowledge, and very few Chinese knowledge involved. Online encyclopedia pages usually include: Title (title), Description (Abstract description), Infobox (message box), Categories (physical category), Crosslingual Links, etc. The Chinese page information of the physical "Huangguoshu Waterfall" in Wikipedia is shown in Figure 1.

Figure 1 The Wikipedia information shows:

article title (Article) : The only sign of the article (except for the ambiguity page), that is, corresponding to a entity, " Huangguoshu Waterfall ".
Summary (ABSTRACT) : It describes the entire article or the entire entity through one or two streamlined information, which has important use value.
Free Text : Free text includes full -text content and part of text content. The content of the full text is to describe all the text information of the entire article, including the abstract information and the introduction of various parts of the information. Some text content is part of the text information describing an article, and users can customize picking.
Category Tag (Category Label) : It is used to identify the types of the article, as shown in the figure "Huangguoshu Waterfall" includes "National 5A Tourist Scenic Area", "China Waterfall", "Guizhou Tourism", etc.
message box (Infobox) : Also known as information module or information box. It uses a structured form to display web information to describe the attributes and attribute value information of articles or entities. The message box contains a certain number of "attribute-attribute values" pairs, gathered the core information of the article to characterize the entire webpage or entity.

. Baidu Encyclopedia

Baidu Encyclopedia is an open and free network encyclopedia platform launched by Baidu Company . As of April 2017, Baidu Encyclopedia has included more than 14.32 million entries, and more than 6.1 million netizens participating in the entry editor cover almost all known fields of knowledge. Baidu Encyclopedia aims to create a Chinese information collection platform covering knowledge in various fields. Baidu Encyclopedia emphasizes user participation and dedication, fully mobilizes the power of Internet users, brings together the wisdom of the majority of users, and actively communicates and shared. At the same time, Baidu Encyclopedia achieves the combination of search with Baidu and Baidu knows to meet the user's needs for information from different levels.Compared with Wikipedia,

contains the most Chinese knowledge and the widest, but the accuracy is relatively poor. Baidu Encyclopedia pages also include: Title (title), Description (Abstract description), Infobox (message box), Categories (physical category), Crosslingual Links, etc. Figure 2 shows the knowledge of Baidu Encyclopedia "PYTHON" webpage. The message box of this webpage is the intermediate part and uses key-value pair. For example, the value of "foreign language name" is "Python", "classic textbook" The corresponding value is "Head First Python".

. Interactive Encyclopedia

Interactive Encyclopedia (www.baike.com) is the pioneering and leader of the Chinese encyclopedia website. It is committed to providing a large amount of mass, comprehensive and timely encyclopedia for hundreds of millions of Chinese users, and through the new Wiki platform Continuously improve the user's creation, acquisition, and sharing method of information. As of the end of 2016, the Interactive Encyclopedia has developed into an encyclopedia website with 16 million words, 20 million pictures, and 50,000 micro -encyclopedia created by more than 11 million users. Users exceed 20 million. Compared with Baidu Encyclopedia,

has higher accuracy and better structured interactive encyclopedia. In the professional field, the knowledge quality is high. Therefore, researchers usually choose interactive encyclopedia as one of the main corpus. Figure 3 shows the homepage of interactive encyclopedia. The information of

Interactive Encyclopedia is divided into two forms of storage, one is a structured information box in the encyclopedia, and the other is the free text of the encyclopedia. For the entry articles in the encyclopedia, only a few words contain a structured information box, but all the entries contain free text. The information box is in the form of a structured way to display the entry information. An example of a typical encyclopedia information box display is shown in Figure 4, which shows the python information information. GUIDO VAN ROSSUM ".

is explained below Selenium technology to crawl three online encyclopedia. The analysis methods of the three encyclopedia are slightly different. Wikipedia first obtains links from the Group 20 (referred to as G20) from the list page, and then conducts webpage analysis and information climbing in turn; Baidu Encyclopedia calls Selenium to automatically operate, enter various programming language , and then access positioning climbing Take; Interactive Encyclopedia uses the linked URL of the webpage, and then go to different attractions for analysis and information capture.

two. SELENIUM crawl Baidu Encyclopedia Knowledge

Baidu Encyclopedia. As the largest Chinese online encyclopedia or Chinese knowledge platform, it provides knowledge of various industries for researchers to engage in research in all aspects. Although the accuracy of the entry is not the best, it can still provide a good knowledge platform for scholars engaged in data mining, knowledge map, natural language processing, big data and other fields.

. Webpage analysis

This section will explain in detail the example of Selenium crawling Baidu Encyclopedia. The theme of the climb is 10 national 5A -level scenic spots. Box information. The core steps of the webpage analysis are as follows:

(1) Call SELENIUM automatic search Baidu Encyclopedia

. First, call Selenium technology to visit Baidu Encyclopedia homepage. The website is "https://baike.baidu.com". Subject Home, the top is the search box. Enter related entries such as "Forbidden City" and click "Enter the entry" to get the detailed information of the Palace Palace.

Then, in the browser mouse selection of the "Enter the Words" button, right -click the mouse click "Examine Elements", you can check the HTML source code corresponding to the button, as shown in Figure 6. Note that the names of the web control or content corresponding code of different browsers are different. The figure uses the 360 security browser , which is called " review element ", and the CHROME browser is called "inspection". QQ browser is called "inspection" and so on.

"Enter the entry" corresponding HTML core code is shown below:

div class = "form ID =" searchform "acion ="/search/word "medhod = "Get" input id = "query" nslog = "normal "name =" word "type =" text "autocomplete =" Off "AutoCorrect =" Off "Value =" Button ID = "Search" nslog = "NORMAL" Type = "Button" ID = "Searchlemma "NSLOG =" NORMAL "Type =" Button "Search/Buttona Class =" Help "HREF ="/Help "NSLOG =" Normal "to help /a/Form .../div

htmm L0 call selenium function You can get the input box input control.

find_element_by_xpath ("// form [@ID = 'searchform']/i nput")

and automatically enter the "Forbidden City", get the button "Enter the entry" and automatically click here. That You can access the "Forbidden City" interface. The core code is shown below:

driver.get ("http: //baike.bai du.com/")Elem_inp=driver.find_element_xpath="//form='Sear chform ']/Input ") ELEM_INP.SEND_KEYS (name) ELEM_INP.SEND_KEYS (keys.return)

(2) Visit the" Forbidden City "page and locate the first step of the message box

step. After completion, enter the "Forbidden City" page and find the INFOBOX part of the intermediate message box , Right -click the mouse and click "Examine Elements", and the return result is shown in Figure 7.

Message Box core code is as follows:

message box mainly uses the form of attribute-attribute values to store information in detail. For example, the corresponding value of the attribute "Chinese name" is " Beijing Forbidden City " and the attribute "foreign language name" corresponding value is "Fobidden City". The corresponding HTML part of the source code is as follows.

div class = "Basic-Info J-Basic-Info CMN-CLLASS =" Basicinfo-Basicinfo-Left "DT Class =" Basicinfo-ITEM " Chinese Name/DTDD Class = "Basicinfo-ITEM VALUE" Beijing Forbidden City /dddt class = "Basicinfo-ITEM name" Foreign Language Name/DTDD Class = "Basicinfo-ITEM VALUE" Forbidden City/DDDT Class = "Basicinfo-ITEM Name" category/dtdd = " BASICINFO-Wem Value " World Cultural Heritage  , History,  History Museum /DD/DL ... DL Class = "Basicinfo-BLOCK BASICINFO-RIGHT" DT Class = "Basicinfo-ITEM Name"/DTDD = "Basicinfo ITE m Value "about about The famous attraction of the square meter/dddt class = "Basicinfo-ITEM name" three major halls of "Basicinfo-ITEM",  ,  Huangji Hall /DD/DD/DD/DD dl ... /div

The entire message box is located in the div class = "Basic-Info J-Basic-Irfo CMN-CLLARFIX" tag. The next one is the HTML tag in DL, DT, DD, in which the message box DIV layout includes two DL .../// DL layout, one is the content of the left part of the message box, the other DL records the content of the right part of the message box, and the attributes and attribute values are defined in each DL tag, as shown in Figure 8.

Note: Use DT and DD's outermost layers must be wrapped in DL, DL tags define the definition list, the project in the DT tag definition list, and the item in the DD tag description. Table table combination tags are similar.

then calls the Find_ELEMELEMENTS_BY_XPATH () function of the selenium expansion package, respectively. The function returns multiple attributes and attribute values, and then outputs multiple elemental values that have been located through the for circulating.The code is as follows:

elem_name = driver.find_elements_by_xpath ("// div e = driver.find_elements_by_xpath ("// div [@Class = 'BASIC-Info J-Basic-Info CMN-CLLARFIX']/DL/DD ") for E in ELEM_NAME: Print (E.Text) for ELEM_VALUE: Print (E.Text)

. Selenium technology The analysis method of crawling Baidu Encyclopedia 5A -level scenic spots is explained. Below is this complete code and some difficulties.

. Code implementation  Note. Next, we try to define multiple Python files to call each other to achieve reptile function.完整代码包括两个文件，即：
test10_01_baidu.py：定义了主函数main并调用getinfo.py文件
getinfo.py：通过getInfobox()函数爬取消息盒
test10_01_baidu.py
# -*- coding: utf-8 - *-"" Test10_01_baidu.py defines the main function main and calls the getinfo.py file by: Eastmount CSDN 2021-06-23 "" Import CodeCSIMPONFO#引#Main function DEF Main ():#file read attractions Information source = open ('data.txt', 'r', enCoding = 'UTF-8') for name in Source: Print (name) getinfo.getinfobox (name) Print ('End Read!') CE.Close () if __name__ == '__main __': main () 
 calls the "Import Getinfo" code in the code to import the Getinfo.py file. Then we Call the getinfobox () function in the getinfo.py file to perform the operation of the climbing message box.
getinfo.py
# coding = UTF-8 "" Getinfo.py: Eastmount CSDN 2021-06-23 "Import OSIM Port CodeCSIMPORT TIMEFROM SELENIUM Import WebdriverFrom Selenium.webdriver.Ceys Import Keys#Getinfobox function: Get the national 5A -level scenic area message box DEF GETINFOBOX (name): TRY:#获 获 百度 获 and automatically search for driver = webdriver.firefox () driver.get ("http://baike.baidu.com/") eLem_inp = Driver. Find_element_by_xpath ("// form [@ID = 'searchform']/input") Elem_inp.send Current_url) Print (Driver.title) #Crane message box incobox content elem_name = driver.find_elements_by_xpath ("// div [@class = 'basic-info j-basic-infox]/dl/dt" Iver.find_elements_by_xpath ("// div ""#"" "Elem_dic = Dict (zip (eLEM_NAME, ELEM_VALUE)) for key in eLEM_DIC: Print (key.text, eLEM_DIC [key] .sleep (5) RetuRNEXC Ept Exception as e: Print ("Error:", E) Finally: Print ('\ n') Driver.close () 
. For example, the crawling process Firefox will automatically search for the "Forbidden City" page, as shown in the figure below: 
H2H2H2H2H The final output of TML0 is as shown in the figure below Show: 
 content is as follows: 
https: // baike. Baidu.com/item/ Beijing Palace Museum in Beijing 97%E4%BA%AC% E6%95%85%E5%AE%AB Beijing Palace Museum_ 百度 贴 吧 Chinese Palace Palace Geographical location No. 4, Jingshan Qianqian Street, Beijing [91] open time 4.1-10.31: 08: 20-17: 16:00, the latest entering the garden 16:10); 11.1-3.31: 08: 30-16: 30 (stop ticketing 15:30, the latest entering the park 15:40); ] [91] Attractions level AAAAA -level ticket price 60 yuan/40 yuan off -season [7] covers an area of 720,000 square meters (construction area of about 150,000 square meters) to protect the world cultural heritage; the first batch of national key cultural relics protection units Approval unit  UNESCO ;  The State Council of the People's Republic of China  batch number III-100 Main Collection Qingming Shanghe Tu,  Qianlong Golden Yonggu Cup , African Barfaret Barfielder [8] official phone 010-85007057 [92 92 ] 
python's running results are shown below, where the data.txt file includes several common attractions. 
 Beijing Palace Museum 
 Huangguoshu Waterfall 
 Summer Palace 
 Bada Ling Great Wall 
 Ming thirteen tomb 
 Gongwangfu HT HT 
 Beijing Olympic Park 
 Huangshan 
 The above code attributes and attribute values are output through the dictionary. The core code is as follows: T (zip (elem_name , ELEM_VALUE)) for Key in ELEM_DIC: Print (key.text, eLEM_DIC [key] .text) 
 at the same time, readers can try to call the local non -interface browser Phantomjs for crawling. The call method is as follows: 
webdriver.phantomjs (Executable_path = "C: \ ... \ Phantomjs.exe") 
 Course Operation: 
 The author taught everyone to climb the message box. At the same time These corpus will be a must -have for your subsequent text mining or NLP field, such as text classification, physical alignment, physical disability, and knowledge map construction. 
 3. SELENIUM crawl wikipedia
 online encyclopedia is the largest amount of data in the Internet. These data have a certain structure and are semi -structured data. As well as Interactive encyclopedia. First of all, the author will introduce the instance of the Wikipedia climbing Wikipedia.
. Webpage analysis of  The first instance author will explain in detail Selenium to crawl the first paragraph of the 20 National Group (G20). The specific steps are as follows: 
 (1) Get the overlink 
2 from the G20 list page. 0 national group list As follows, Wikipedia uses the first letter of Chinese words in English to sort, such as "JAPAN", "Italy", "BRAZIL", etc. Each country uses a hyperlink form to jump. 
https: //en.wikipedia.org/wiki/category: G20_nations
, you need to obtain the hyperlinks of 20 countries, and then go to the specific page for crawling. Select the hyperlinks of a country, such as "China", right -click the mouse and click the "Check" button to get the corresponding HTML source code, as shown below. 
, the hyperlink is located in the UL Li A node of the layout of "MW-Category-GROUP". Corresponding code: 
div Class = "MW-PAGES" DIV =. "en" dir = "ltr" class = "mw- Content-ltr "div class =" mw-category "div class =" mw-category-found "H3CH3ULLIA HREF ="/Wiki/China "Title =" China "China/A/Li/Ul/DivDiv Class = "MW- Category-found ".../divDIV class =" mw-category-found ".../div .../div/div/div
 call selenium's find_elements_by_xpath () function to get the node cl The ASS attribute is "MW-Category-GROUP "The hyperlink, it will return multiple elements. The core code of the positioning hyperlink is as follows: 
driver.get ("https://en.wikipedia.org/w iki/category: g20_nations") elem = driver.find_elements_by ("// div @class = 'mw-category-found ']/ul/li/a ") for e in eLEM: Print (e.text) prop (e.Get_attribute (" href ")) 
 function found_elements_by_xpt () first analyze the DOM tree structure of html and position it. To the specified node And get its element. Then define a FOR cycle , and obtain the content of the node and the HREF attribute in order. Among them, E.Text represents the content of the node, such as the content between the nodes below is China. 
a href = "/wiki/China" title = "China" China/A
 at the same time, E.Get_attribute ("HRRF") means the corresponding attribute value of the client attribute HREF, that is, "/wiki/c, hina ", the same reason, e. get_attribute ("Title") can obtain the title title attribute and get the value "China". 
 is stored in the variables as shown below at this time, and then positions in each country and obtains the required content. 
 (2) Call Selenium positioning and crawl the pages of various countries. As shown in the figure, you can see the page URL, title, abstract, content, message box, etc. Among them, the message box is on the right part of the way, including the full name and location of the country. 
 is described by the form of attribute-attribute values below. It is very concise and precisely summarized a web entity, such as the capital-Beijing, population-1.3 billion people. Generally, after obtaining this information, you need to perform a pre -processing operation before you can analyze the data. The following chapters will be explained in detail. After 
 visits the page of each country, the first introduction of each country needs to be obtained. The content of  crawler  explained in this section may be relatively simple, but the method of explanation is very important, including how to locate nodes and crawl knowledge about climbing knowledge Essence The core code of the corresponding HTML core code of the details page is as follows: 
div class = "mW-PARSER-OUTPut" div lole = "note" class = "hatnote navigation-searchable" .../Divdiv Role. = "Note" class = "Hatnote navigation-not-searchable ".../Divtable class =" Infobox Gegraphy Vcard ".../Tablepbchina/B, Officially TheBPEOPLE's Republic of china/b ..../pp .../pp .../pp .../pp .../pp .../pp .../pp .../pp .../pp .../pp .../pp ... Then, then ../table/div/div/div
 browser review element method is shown in the figure. 
 The content is located under the DIV node of the attribute class "MW-PARSER-OUTPUT". In HTML, P tags represent paragraphs, which are usually used to identify the text, and the B tag means bold.Get the first section of the content, to locate the first P node. The core code is as follows: 
driver.get ("https://en.wikipedia.org/wiki/") Elem = Driver.find_element_Xpath ("// Div [@class = 'mw-parser- output ']/p [2 ] "). TextPrint Elem
 Note that the first paragraph of the text is located in the second P paragraph, so you can get P [2]. At the same time, if the reader wants to obtain a message box from the source code, the position of the message box needs to be obtained and the data is captured. The content of the information box (Infobox) corresponds to the following nodes in HTML, which records the core information of the web entity. 
table class = "Infobox Gegraph" .../Table
. Code implementation of 
 full code reference file test10_02.py, as shown below: 
 # coding = UTF-8# by: EastMount CSDN 2021-06-23import Timeimport Reimport OSFRENIUM Import WebdriverFrom Selenium.webdriver.Common.keys Import Keysdriver = Webdriver.firefox () Driver.get ("https://en.wikiped/wiki/category:g2020 _nations ") ELEM = Driver.find_elements_by_xpath (" // DIV [ @class = 'mw-category-found']/ul/li/a ") name = [] #National name urls = [] #National hyperlink #Crazy link for e in Elem: Print (E.Text) PRINT (E.get_attribute ("href")) name.append (e.text) urls.append (e.get_attribute ("href")) print (name) print (urls)#Crazy content for url in urls: driver. get (url) eLEM = draff As shown in the figure. 
ps: You can simply try this part, and it is recommended to crawl Baidu Encyclopedia, Interactive Encyclopedia and Sogou Encyclopedia. 
 4. SELENIUM crawling Interactive Encyclopedia 
 has passed in the past. Interactive encyclopedia has become an encyclopedia, but fortunately, the webpage structure has not changed. 
. Webpage analysis  At present, online encyclopedia has developed as important sources of semantic analysis, knowledge map construction, natural language processing, search engine and artificial intelligence. As one of the hottest online encyclopedias, interactive encyclopedia provides researchers with strong corpus support. 
 will explain the abstract information of the ten most popular programming language pages climbing interactive encyclopedia. Through this instance, the reader's impression of using the technology of using Selenium crawlers to analyze the analysis techniques of network data crawling more deeply. Different from Wikipedia first crawling the word list list of the word list, then climbing the required information, Baidu Encyclopedia input entry into the relevant page and then targeted climbing. To climb information to the detailed interface of the entry. 
 has a certain rule due to the interactive encyclopedia's search for different words, that is, the "common URL+search entry name" method is used to jump. Here we set different entry web pages through this method. The specific steps are as follows: 
 (1) call selenium analyzes URL and search for interactive encyclopedia 
. We first analyze some rules of interactive encyclopedia, such as the search character "Guizhou". // www. Baike.com/wiki/ Guizhou 
 corresponding page is shown in the figure. From the figure, it can be seen that the top hyperlink URL, the entry of "Guizhou", and the first paragraph of "Guizhou", "The right side is the corresponding side is the corresponding side Pictures such as pictures and other information 
. 
 can get a simple rule, namely: 
http: // www .baike.com/wiki/entry 
 can search for the corresponding knowledge, such as the programming language "Java" corresponding to: http: //www.baike.com/wiki/java
 0 (2) Visit the popular TOP10 programming language and crawl abstract信息
2016年，Github根据各语言过去12个月提交的PR数量进行排名，得出最受欢迎的Top10编程语言分别是：JavaScript、Java、Python、Ruby、PHP、C++、CSS、C#、C和GO language.
 then needs to be distributed to obtain the abstract information of these ten languages. Select the summary part in the browser, right -click the mouse click "Examine Elements" and return the result as shown in the figure. You can see the corresponding HTML source code corresponding to the abstract part at the bottom. 
 new version of the "Quick Encyclopedia" content is shown in the figure below: 
 "Java" entry summary part of the corresponding HTML core code as shown below: 
DIV CLA SS = "Summary" div class = "content-p" span class = " "Java is an object -oriented/spana href ="/wikiid/76015795959865866248? From = wiki_content "class =" "clickLog =" baike_search_clink "span class =" "" Language/span/aspan class = "", not only absorbed C ++ The various advantages of the language also abandon the concepts of multiple inheritance and pointers that are difficult to understand in C ++. Therefore, the Java language has two characteristics of powerful and simple and easy to use. As the representative of the static object -oriented language of the Java language, the Java language has excellently realized the object -oriented theory, allowing programmers to perform complex programming in an elegant way of thinking. /span/divDIV class = "content-p" span class = "" Java is simple, distributed,/span .../div/div
 to call SELENIUM's Find_ELEMENT_BY_XPATH () function , You can get the summary paragraph information, The core code is as follows. 
driver = webdriver.firefox () url = "http://www.baike.com/wiki/" + namedriver.get (url) Elem = driver.find_element_by_xpath ( "// div [@class = 'sumary']/div /span")Print_text)
 Basic steps are: 
 first calls webdriver.firefox () driver to open the fire fox browser. 
 analyzes web hyperlinks and calls the Driver.get (url) function access. 
 analysis webpage DOM tree structure, call driver.find_element_by_xpath () for analysis. 
 output results, the content of some websites needs to be stored in the local area, and needs to be filtered out of unnecessary content. 
 is a complete code and detailed explanation below. 
. Code implementation of  full code is blog10_03.py as shown below. Main function main () Call the getgetabstract () function to crawl the top10 programming language abstract information. 
# coding = UTF-8# by: Eastmount CSDN 2021-06-23import Osimport Codecsfrom Selenium Import Webdriverfrom.webdriver.Common.key.key s Import Keysdriver = Webdriver.Firefox ()#Get the abstract information DEF GETABSTRACT (name): TRY:# The new folder and file BasepathDirectory = "Hudong_CODING" if not os.path.exists (BasepathDirectory): OS.Makedirs (BasepathDirectory) baidufile = OS.Path.join (BASE PathDirectory, "HudongSpider.txt")#file does not exist in newly built, existing Then write it to if not os.path.exists (baidufile): info = codecs.open (baiduFile, 'W', 'UTF-8') Else: Info = Codecs.open (baidufile, 'a', 'UTF- 8 ') URL = "http://www.baike.com/wiki/" + nameprint (url) driver.get (url) eLEM = driver.find_elements ]/div /span ") Content =" "for e in eLEM: Content += E.TextPrint (Content) Info.writlines (Content +'\ R \ n') Except Exception as e: Print (" Error: ", Finally: Print ('\ n') Info.write ('\ r \ n')#Main function def mail (): Languages = ["javascript", "java", "python", "ruby", "pHP", "" C ++ "," CSS "," C#"," C "," GO "] PRINT ('Starting Crane') for Lg in Languages: Print (Lg) getabstract (Lg) Print ('End Crane') if __name___ == '__main __': main () 
. Among the scratch results of "JavaScript" and "Java" programming language, the code of this paragraph has climbed the summary information of the popular ten languages in the interactive encyclopedia. The 
 program has successfully captured the summary information of each programming language. As shown in the figure below: 
 also stores the data to the local TXT file, which will be effective to support the analysis of NLP and text mining.
 is written here. Several common encyclopedia capture methods are introduced. I hope you like it. 
 5. This chapter is summary of 
 Online Encyclopedia is widely used in scientific research work, knowledge map and search engine construction, data integration of large and small companies, and web2.0 knowledge base. The language version and other characteristics are loved by scientific researchers and company developers. Common online encyclopedias include Wikipedia, Baidu Encyclopedia, and Interactive Encyclopedia. 
 This article combines Selenium technology to crawl Wikipedia's paragraph content, Baidu Encyclopedia's message box, and interactive encyclopedia abstract information, and uses three analysis methods. I hope that readers can master SELENIUM technology to climb the webpage through the case of this chapter. 
 message box crawling 
 text summary 
 web page multiple jump methods  2
 webpage analysis and climb core code 
 file saving 
seleneium The pan -field is automated testing. It runs directly in the browser (such as Firefox, Chrome, IE, etc.), just like real users, conduct a variety of tests on the development of web pages, and it is a must -have tool for automation testing. It is hoped that readers can master the climbing method of this technology, especially the target webpage needs to verify login and other situations. 
 This series of code download links: 
https: //github.com/eastmountyxz/python-zero2ONE
 Last 
 more reference wonderful blog post Please see here: @请 请 请  
 You can add a attention, you can follow, you can pay attention, Like a praise, continue to update, hehe! 
————————————————
版权声明：本文为CSDN博主「Eastmount」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及This statement. 
 Original link: https://blog.csdn.net/eastmount/article/details/118147562



                    
                        
                            hotcomm
                        
                    

                    
                    
                        hotcomm Category Latest News
                                                    
    
        
            
            
        
        
        hotcomm
    

    Regarding the latest news about Apple's A16, the report quoted supply chain sources as saying that Apple's self-developed new generation A16 application processor has completed the design and finalization. The number of CPU and GPU cores is similar to that of A15, and will adopt 
    
        Apple accepts TSMC's price increase, and has reserved at least 120,000 pieces of 4nm production capacity
        
            05/14
            1520
        
        
    
    


                                                    
    
        
            
            
        
        
        hotcomm
    

    TSMC recently held a second-quarter performance conference. Although it expressed confidence in the chip industry, it finally concealed that the future may not be so good. Oversupply of chips has emerged, and some American chip companies would even rather pay high liquidated dama
    
        The fig leaf was unveiled, and TSMC confirmed that the US chip inventory could only rely on China.
        
            05/14
            1772
        
        
    
    


                                                    
    
        
            
            
        
        
        hotcomm
    

    IT Home reported on May 28 that according to the Economic Daily of Taiwan, TSMC had previously proposed at the French statement meeting that HPC (high performance computing) will become the strongest driving force for long-term growth and will bring the greatest contribution to t
    
        TSMC: HPC is expected to be the strongest growth platform by 2025
        
            05/14
            1807
        
        
    
    


                                                    
    
        
            
            
        
        
        hotcomm
    

    The TSMC legal person briefing was held on the 16th. The impact of Huawei's ban remains the focus of concern to legal persons. Chairman Liu Deyin said that Huawei will no longer be shipped after September 14, but the US government may relax the shipment of general products from H
    
        In response to the US ban, TSMC will cut off supply to Huawei after September 14
        
            05/14
            1185
        
        
    
    


                                                    
    
        
            
            
        
        
        hotcomm
    

    The main topics of this legal meeting include the third quarter financial report, the fourth quarter outlook, capital expenditure, raising OEM prices, next year's outlook, advanced process progress, global layout, etc.
    
        Extremely explosive! TSMC officially announced the establishment of a factory in Japan, retaliating against Samsung's 2nm challenge, but production capacity is still lacking
        
            05/14
            1974
        
        
    
    


                                                    
    
        
            
            
        
        
        hotcomm
    

    According to Jiwei.com, TSMC is scheduled to hold a legal meeting on October 13. Industry insiders said it remains to be seen whether TSMC will modify its CAGR expectations for capital expenditure and revenue in the legal opinion.
    
        Industry: TSMC says it will be imminent and may become cautious about the demand prospects
        
            05/14
            1555
        
        
    
    


                                                    
    
        
            
            
        
        
        hotcomm
    

    With the development of technology, the demand for chips in the electronics market is increasing. According to survey data from Strategy Analytics, the global AP market grew by 23% in 2021, reaching a high of US$30.8 billion.
    
        Is TSMC a decision equivalent to admitting it?
        
            05/14
            1367
        
        
    
    


                                                    
    
        
            
            
        
        
        hotcomm
    

    Recently, Applied Materials, the world's largest semiconductor equipment company, released its third-quarter financial report for the 2022 fiscal year, with revenue increasing by 5% year-on-year; Kelei, the world's largest semiconductor test equipment company, also achieved good 
    
        Can semiconductor equipment companies resist cyclical turmoil?
        
            05/14
            1786
        
        
    
    


                                                    
    
        
            
            
        
        
        hotcomm
    

    According to Jiwei.com, according to Taiwan media "Electronic Times", sources said that as the order-cutting effect of IC design manufacturers' customers begins to be released, TSMC's wafer foundry capacity utilization rate is expected to show a downward trend in the next six mon
    
        TSMC's 7nm and some mature processes will have a downward trend in the next 6 months
        
            05/14
            1626
        
        
    
    


                                                    
    
        
            
            
        
        
        hotcomm
    

    Originally, Samsung planned to implement mass production of 3nm process this year, but according to DIGITIMES, Samsung's 3nm GAA process yield rate has just reached between 10% and 20%, accounting for only 20% at most. Next, Samsung's 3nm node will be more radical, abandoning the
    
        Samsung's 3nm chip yield is exposed, is TSMC stable again?
        
            05/14
            1773
        
        
    
    


                                            

                    

                    
    
        
            Recommended
            
                
                                            
                            According to Dolphin stock statistics, as of the close, Yuexiu Financial Holdings' stock price was 11.67 yuan, with a rise and fall of -3.31%, a turnover rate of 1.43%, and a transaction of 447 million yuan throughout the day. 01. Funding Interpretation Today, the stock's main fu
                        
                                            
                            On the Internet, I don’t know how many people have said that American officials are so clean, and they have nothing to do with them. They are seduced by female spies, so they don’t even think about it. So many people’s impression is that American officials are in this regard!
                        
                                            
                            Taoyuan International Airport, the largest airport in Taiwan, China, has released its throughput in 2021, with only more than 900,000 passengers in the whole year, a record low, while the freight volume is the opposite, exceeding 2.8 million tons, a record high.
                        
                                            
                            Cai Ruixue was discovered by netizens to appear in the MV of the South Korean talent show "Idol School". He was once named "Goddess of the North". When he was traveling to South Korea, he was a Korean
                        
                                            
                            The role played by the Taiwan military in it made the Taiwanese people "break the defense" even more. The Taiwan military initially denied that mainland missiles leaped on Taiwan Island. It was not until the Japanese Ministry of Defense disclosed the missile trajectory that Taiwa
                        
                                            
                            On June 28, the online movie "The Woman's Group of Ai and Music" jointly produced by Ciwen Media and iQiyi was premiered in Beijing.
                        
                                            
                            Anhui Teachers: The 2019 Anhui Province Primary and Secondary School Teacher Recruitment Examination Score Query Portal According to the announcement, the 2019 Anhui Province Teacher Examination Results will be announced on July 23.
                        
                                            
                            Before and after the Grain Rain, the peonies bloomed and the cherries ripened. As Fan Chengda said in his poem: "The peony broke the calyx and ripened the cherries." The tender cherries are red and round, as crystal clear as red agate.
                        
                                            
                            Cause: When Jin Mi martyred the demons and fought with her, the heroine Jin Mi felt that it was because of herself that the disaster occurred, causing the creatures in the six realms to suffer.
                        
                                            
                            After entering the 17th century, China in the first half of the century was in the late Ming Dynasty and China in the second half of the century was in the early Qing Dynasty, and both sides had more exchanges with Western countries.
                        
                                    
            
            
        

        
            Hot
            
                
                                            
                            1
                            Trendforce Corp. concluded that the U.S. government recently decided to impose new restrictions on the sale of semiconductor and chip manufacturing equipment to China, which could hurt sales of South Korean chipmakers' Chinese foundries and TSMC.
                        
                                            
                            2
                            According to Hong Kong China Review Network, Taiwan's "Epidemic Command Center" announced today that it will relax the entry application for foreign spouses and minor children, and change the project procurement model to apply for entry through the case. This time, the new system
                        
                                            
                            3
                            Therefore, Zheng Chenggong vowed to fight the Qing Dynasty to the death. Li Zicheng, Zhang Xianzhong and other troops could not withstand the attack of the Qing army and had retreated into Myanmar. The Yongli court in the southwest was in a slump. Although the Zheng army effectiv
                        
                                            
                            4
                            This week's product weekly report involves tracks including Chinese language, educational hardware, e-sports education, memory training, etc. Under the trend of big Chinese, Jinling Chinese must rely on the "Huagen Curriculum" to open up the trend of "big Chinese" teaching and pr
                        
                                            
                            5
                            This negotiation conference provides 500 positions, with an annual salary of up to 400,000, attracting more than 1,000 applicants. If you see the three highlights of this recruitment negotiation, you can understand why you pay a high salary of 400,000 yuan and you can still under
                        
                                            
                            6
                            On the Internet, I don’t know how many people have said that American officials are so clean, and they have nothing to do with them. They are seduced by female spies, so they don’t even think about it. So many people’s impression is that American officials are in this regard!
                        
                                            
                            7
                            The 2022 Dongying Teacher Recruitment Examination ended on July 9th, and the written test scores can be checked. In order to facilitate all candidates to understand the written test pass scores, entry lists and other information of the written test pass scores, entry lists and ot
                        
                                            
                            8
                            During the Ming Dynasty, the Zhu family emperors were not very interested in the ocean. Not only did the Penghu Inspection Department be abolished in the middle, but the people were also not allowed to go to the sea without permission. At this time, it was in the era of great nav
                        
                                            
                            9
                            Generally speaking, if you are over 30 years old, it is difficult to have the opportunity to take the teacher training exam. In many areas, in the recruitment of public primary and secondary school teachers, the age requirements for undergraduate applicants are "30 years old and 
                        
                                            
                            10
                            Taoyuan International Airport, the largest airport in Taiwan, China, has released its throughput in 2021, with only more than 900,000 passengers in the whole year, a record low, while the freight volume is the opposite, exceeding 2.8 million tons, a record high.
                        
                                    
            
            
        
    
    

                    
                        hotcomm video recommendation
                        
                                                            
                                    
                                        
                                        
                                        USU and China trade spotlight...
                                    
                                                                            55:01
                                                                    
                                                            
                                    
                                        
                                        
                                        Underground market-demo video...
                                    
                                                                            1:54
                                                                    
                                                            
                                    
                                        
                                        
                                        What is the Company behind the WLGS Stock Ticker? ...
                                    
                                                                            9:30
                                                                    
                                                            
                                    
                                        
                                        
                                        Welcome to our production workshop！...
                                    
                                                                            0:24
                                                                    
                                                            
                                    
                                        
                                        
                                        China's 'uneven' recovery will show in market perf...
                                    
                                                                            1:52
                                                                    
                                                            
                                    
                                        
                                        
                                        What lead to Chinese stock selloff?...
                                    
                                                                            2:43
                                                                    
                                                            
                                    
                                        
                                        
                                        Meet the CEO: Wang Ming Chieh Explains the Vision ...
                                    
                                                                            1:31
                                                                    
                                                            
                                    
                                        
                                        
                                        China is a short to medium-term trade, says Highto...
                                    
                                                                            3:59
                                                                    
                                                            
                                    
                                        
                                        
                                        Stocks Rally fro the Week on Fed Rate-Cut Optimism...
                                    
                                                                            1:31:49
                                                                    
                                                            
                                    
                                        
                                        
                                        Biden Slams Trump For Wanting Stock Market To 'Cra...
                                    
                                                                            0:48