Data import Before importing data, we first create some constraints to avoid importing some duplicate data: CREATE constraint on ASSERT p.id is unique; CREATE constraint on ASSERT m.id is unique; Next we will import the data into Neo

2024/05/2007:51:33 hotcomm 1268


Original link: https://medium.com/neo4j/finding-the-best-tennis-players-of-all-time-using-weighted-pagerank-6950ed5fc98e

The latest version of the Neo4j graphics algorithm library adds weight to the PageRank algorithm Variable support.

My colleague Ryan (https://twitter.com/ryguyrg/) recently published a paper "Who is the best tennis player of all time?" Complex network analysis based on the history of professional tennis" (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0017249), in this paper, he used a variant of PageRank algorithm , so I was thinking, I am also a tennis enthusiast, can I do something based on this algorithm?

I originally planned to do some data capture, but Kevin Lin has already done some more difficult work. He put all the competition results by the end of 2017 in the form of csv files on Github "atp-world-tour-tennis" -data》(https://github.com/serve-and-volley/atp-world-tour-tennis-data).

Thanks to Kevin for his contribution.

data import

Before importing data, we first create some constraints to avoid importing some duplicate data:

CREATE constraint on (p:Player)ASSERT p.id is unique;CREATE constraint on (m:Match)ASSERT m.id is unique;

Next we will import the data into Neo4j. First copy the CSV file created by Kevin to the Neo4j import directory. After

is completed, we can use Cypher's LOAD CSV command to import data into Neo4j.

LOAD CSV FROM "file:///match_scores_1968-1990_UNINDEXED.csv" AS rowMERGE (winner:Player {id: row[8]}) ON CREATE SET winner.name = row[7]MERGE (loser:Player {id: row [11]}) ON CREATE SET loser.name = row[10]MERGE (m:Match {id: row[22]})SET m.score = row[15], m.year = toInteger(split(row[ 0], "-")[0])MERGE (m)-[w:WINNER]-(winner) SET w.seed = toInteger(row[13])MERGE (m)-[l:LOSER]-(loser ) SET l.seed = toInteger(row[14]);LOAD CSV FROM "file:///match_scores_1991-2016_UNINDEXED.csv" AS rowMERGE (winner:Player {id: row[8]})ON CREATE SET winner.name = row[7]MERGE (loser:Player {id: row[11]})ON CREATE SET loser.name = row[10]MERGE (m:Match {id: row[22]})SET m.score = row [15], m.year = toInteger(split(row[0], "-")[0])MERGE (m)-[w:WINNER]-(winner) SET w.seed = toInteger(row[13] )MERGE (m)-[l:LOSER]-(loser) SET l.seed = toInteger(row[14]);LOAD CSV FROM "file:///match_scores_2017_UNINDEXED.csv" AS rowMERGE (winner:Player {id: row[8]}) ON CREATE SET winner.name = row[7]MERGE (loser:Player {id: row[11]}) ON CREATE SET loser.name = row[10]MERGE (m:Match {id: row[22]})SET m.score = row[15], m.year = toInteger(split(row[0], "-")[0])MERGE (m)-[w:WINNER]-(winner ) SET w.seed = toInteger(row[13])MERGE (m)-[l:LOSER]-(loser) SET l.seed = toInteger(row[14]);

This model is very simple, you can run the following Request to see the visual description:

CALL db.schema()

You can see:

Data import Before importing data, we first create some constraints to avoid importing some duplicate data: CREATE constraint on ASSERT p.id is unique; CREATE constraint on ASSERT m.id is unique; Next we will import the data into Neo - DayDayNews

It looks good. Before continuing, let's write a simple query to take a look at the data:

MATCH p=()-[:LOSER]-()-[r:WINNER]-() RETURN p LIMIT 25

Data import Before importing data, we first create some constraints to avoid importing some duplicate data: CREATE constraint on ASSERT p.id is unique; CREATE constraint on ASSERT m.id is unique; Next we will import the data into Neo - DayDayNews

The player with the most wins

Now, We want to see the player with the most wins. How should we write this statement?

MATCH (p:Player)WITH p,size((p)-[:WINNER]-()) AS wins,size((p)-[:LOSER]-()) as defeatsRETURN p.name, wins, defeats, CASE WHEN wins+defeats = 0 THEN 0ELSE (wins * 100.0) / (wins + defeats) END AS percentageWinsORDER BY wins DESCLIMIT 10

Run the above statement and you will see the following output:

Data import Before importing data, we first create some constraints to avoid importing some duplicate data: CREATE constraint on ASSERT p.id is unique; CREATE constraint on ASSERT m.id is unique; Next we will import the data into Neo - DayDayNews

If you are also a tennis fan, you may Recognize most of the names on this list. Most of them are considered the best players of all time, but just counting the number of games won doesn't seem too rigorous.

At this point, it seems that we can try a more advanced method---PageRank algorithm....

Create a credible projection graph

Determine the credibility of a node through its entry relationship, this This is how the PageRank algorithm works. For example, in the online world, a web page brings credibility to it by linking to another web page. This credibility can be determined by the weight attribute of this relationship.

In our world of tennis, a player's credibility is determined by how many wins and losses they have compared to each other. For example, the following query shows how many times Federer and Nadal have won against each other.

MATCH (p1:Player {name: "Roger Federer"}), (p2:Player {name: "Rafael Nadal"})RETURN p1.name, p2.name,size((p1)-[:WINNER]-() -[:LOSER]-(p2)) AS p1Wins,size((p1)-[:LOSER]-()-[:WINNER]-(p2)) AS p2Wins

The running output is as follows:

Data import Before importing data, we first create some constraints to avoid importing some duplicate data: CREATE constraint on ASSERT p.id is unique; CREATE constraint on ASSERT m.id is unique; Next we will import the data into Neo - DayDayNews

Our projection image should be in A direct relationship is established between Federer and Nadal, using weights to represent the number of times they have won each other's matches. The weight of the relationship from Federer to Nadal is 23, meaning that Federer beat Nadal 23 times. The weight of the relationship between Nadal and Federer is 15.

. We write the following query statement to project this picture:

MATCH (p1)-[:WINNER]-(match)-[:LOSER]-(p2) WHERE p1 .name IN ["Roger Federer", "Rafael Nadal"]AND p2.name IN ["Roger Federer", "Rafael Nadal"]RETURN p2.name AS source, p1.name AS target, count(*) as weightLIMIT 10

The output of this query is as follows:

Data import Before importing data, we first create some constraints to avoid importing some duplicate data: CREATE constraint on ASSERT p.id is unique; CREATE constraint on ASSERT m.id is unique; Next we will import the data into Neo - DayDayNews

The next thing we need to do is to delete the WHERE condition so that this query can be performed on the entire graph.

Use weighted PageRank to discover the best tennis players

Now we call the weighted PageRank algorithm through the weightProperty parameter of the PageRank algorithm. By default, the PageRank algorithm is in unweighted mode. The following statement of

is to run the weighted PageRank algorithm on the entire image:

CALL algo.pageRank.stream( "MATCH (p:Player) RETURN id(p) AS id", "MATCH (p1)-[:WINNER]-( match)-[:LOSER]-(p2)RETURN id(p2) AS source, id(p1) AS target, count(*) as weight ", {graph:"cypher", weightProperty: "weight"})YIELD nodeId , scoreRETURN algo.getNodeById(nodeId).name AS player, scoreORDER BY score DESCLIMIT 10

The running results are as follows:

Data import Before importing data, we first create some constraints to avoid importing some duplicate data: CREATE constraint on ASSERT p.id is unique; CREATE constraint on ASSERT m.id is unique; Next we will import the data into Neo - DayDayNews

We can see that the head of our ranking is different from the ranking of Filippo Radicchi's paper. The main difference is that Federer, Na Dahl and Djokovic rounded out the top five. This is because Radicchi's analysis only extends to 2010, and these three players have been very good in the next 8 years, so this is why our rankings are different.

We can template only include games before 2010, then the following query statement:

CALL algo.pageRank.stream( "MATCH (p:Player) RETURN id(p) AS id", "MATCH (p1)-[:WINNER] -(match)-[:LOSER]-(p2)WHERE match.year = $year RETURN id(p2) AS source, id(p1) AS target, count(*) as weight ", {graph:"cypher", weightProperty: "weight", params: {year: 2010}})YIELD nodeId, scoreRETURN algo.getNodeById(nodeId).name AS player, scoreORDER BY score DESCLIMIT 10

The running effect is as follows:

Data import Before importing data, we first create some constraints to avoid importing some duplicate data: CREATE constraint on ASSERT p.id is unique; CREATE constraint on ASSERT m.id is unique; Next we will import the data into Neo - DayDayNews

Note that in this query, we will use the year Values ​​are passed as parameters into Cypher projection queries via the params key.

The top two in our rankings are now the same as Radicche's, but Federer is currently in third place rather than seventh in Radicche's rankings, while Nadal and Djokovic are already ranked in our rankings. Out of the top ten.

We may also query the PageRank ranking of a certain competition. The following query is the PageRank ranking in 2017

CALL algo.pageRank.stream( "MATCH (p:Player) RETURN id(p) AS id", "MATCH (p1)- [:WINNER]-(match)-[:LOSER]-(p2) WHERE match.year = $yearRETURN id(p2) AS source, id(p1) AS target, count(*) as weight ", {graph:" cypher", weightProperty: "weight", params: {year: 2017}})YIELD nodeId, scoreRETURN algo.getNodeById(nodeId).name AS player, scoreORDER BY score DESCLIMIT 10

The running effect is as follows:

Data import Before importing data, we first create some constraints to avoid importing some duplicate data: CREATE constraint on ASSERT p.id is unique; CREATE constraint on ASSERT m.id is unique; Next we will import the data into Neo - DayDayNews

The picture below is the 2017 ATP World The Tour's year-end ranking

Data import Before importing data, we first create some constraints to avoid importing some duplicate data: CREATE constraint on ASSERT p.id is unique; CREATE constraint on ASSERT m.id is unique; Next we will import the data into Neo - DayDayNews

This ranking is completely different from our ranking! What is the reason for this? This is because the official ranking gives different weight to each match, while our PageRank ranking gives equal weight to each match.

Well, that’s it for the problem of using weighted PageRank to find the most optimized tennis player in history. I look forward to seeing more people using weighted PageRank to solve other problems. If you have used it, please tell me [email protected]

Enjoy!

Translator's words: The author only introduced how to implement this from the application perspective, and did not introduce the functions of each parameter of the algo.pageRank.stream method. I will be available in the future. Find relevant articles and introduce them to you.

hotcomm Category Latest News