Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G

2024/05/0714:39:34 hotcomm 1976
Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50GB or even 500GB in size.

But now, these datasets are not easy to use. They may be small enough to fit into the hard drive of your everyday laptop, or they may be large enough to match the RAM. As a result, they have become difficult to open and examine, let alone explore or analyze.

generally uses 3 strategies when processing these data sets. The first is to subsample the data. Its disadvantages are obvious: crucial parts may be missed, or worse, not looking at the whole story may distort the data and the facts it expresses. Another strategy is to use distributed computing. While this is an effective approach in some cases, it comes with significant overhead in managing and maintaining the cluster. Imagine what it would be like to have to set up a cluster for a data set that is not in the range of RAM (say, in the 30-50GB range). To me, this seems overwhelming. Alternatively, you can rent a powerful cloud instance with enough memory to process the relevant data. For example, AWS offers instances with megabytes of RAM. In this case, you still need to manage the cloud data bucket, wait for the data to be transferred from the bucket to the instance every time the instance starts, deal with the compliance issues that come with putting the data on the cloud, and handle All the inconveniences that come with going to work. Of course, not to mention the cost, which although the starting price is low, tends to get higher and higher over time.

In this article, I'll show you a new approach that makes it easier to do data science research with data of almost any size, as long as the data can be stored on the hard drive of a laptop, desktop, or server. Faster, safer and more convenient.

Vaex

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Vaex is an open source data framework library that enables visualization, exploration, analysis and even machine learning on tabular data sets that are as big as your hard drive. To do this, Vaex employs concepts such as memory mapping, efficient out-of-core algorithms, and deferred computation. All of this is tied together with a pandas-like API class that anyone can start using right away.

Billion Taxi Analysis

To illustrate this concept, let’s do a simple exploratory data analysis on a dataset that is large enough to fit into the RAM of a typical laptop. In this article, we will use the New York (NYC) Taxi Dataset, which contains over 1 billion iconic yellow taxis between 2009 and 2015. The data can be downloaded from this website and is provided in CSV format. The complete analysis can be viewed separately in this Jupyter notebook.

Cleaning up the streets

The first step is to convert the data into a memory mappable file format such as Apache Arrow, Apache Parquet or HDF5. An example of converting CSV data to HDF5 can be found here. Once the data is in a memory mappable format, opening it with Vaex is instant (0.052 seconds!), even though the data on disk exceeds 100GB:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Opening memory mapped files with Vaex only takes 0.052 seconds, even if they exceed 100 GB

Why is it so fast ? When opening a memory mapped file using Vaex, no data is actually read. Vaex only reads file metadata, such as the location of the data on disk, data structure (number of rows, columns, column names and types), file description, etc. So what if we want to inspect or interact with data? Open a dataset to generate a standard data frame and check if it is also fast:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

New York City Yellow Taxi Data Preview

Notice again that the unit execution time is very short. This is because displaying a Vaex dataframe or column only requires reading the first 5 rows and the last 5 rows from disk. This brings up another important issue: Vaex will only traverse the entire data set when necessary, and it will pass as little data as possible.

Anyway, let’s first clean this dataset from extreme outliers or bad data input. A good way to start is to get a high-level overview of your data using the describe method, which shows the number of samples for each column, the number of missing values, and the data type. If the column's data type is numeric, the mean, standard deviation, and minimum and maximum values ​​will also be displayed. All of these statistics are calculated with a single pass over the data.

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Use the describe method to get a high-level overview of the dataframe. Note that the dataframe contains 18 columns, but only the first 7 columns are visible in this screenshot. The

description method gives a good idea of ​​the power consumption and efficiency of Vaex: All these statistics are on my MacBook Pro (15", 2018 , 2.6GHz Intel Core i7, 32GB RAM). Other libraries or methods require distributed computing or 100GB+ cloud instances to perform the same calculation. With Vaex, all you need is data. , your laptop only needs a few GB of memory. Looking at the output of describe , it's easy to notice that the data contains some serious outliers. First, let's start by checking the pickup location and removing the outliers. The easy way is to simply plot the pick-up and drop-off locations and visually define the areas of New York City we wish to focus our analysis on. Since the dataset we are using is so large, the histogram is the most efficient visualization method to create with Vaex. and displaying histograms and heatmaps in is so fast that such plots can be much more interactive

df.plot_widget(df.pickup_longitude,

df.pickup_latitude,

shape=512,

limits='minmax',

f='log1p' ,

colormap='plasma')

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Once we have interactively decided which area of ​​New York City we want to focus on, we can simply create a filtered data frame:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

The coolest thing about the above code block is that it requires negligible memory! When filtering a Vaex dataframe, it does not make a copy of the data, instead it just creates a reference to the original object and applies a binary mask on it. The mask selects which rows are displayed and used in future calculations. 100GB of RAM saved.

Now, let's check the passenger count column. The maximum number of passengers recorded in a taxi trip is 255, which seems a bit extreme. Let's count the number of trips per passenger. It's easy to implement using the value counting method:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

applying the "value counts" method to 1 billion rows only takes about 20 seconds!

From the image above we can see that trips with more than 6 passengers may be rare outliers, or simply bad data entry. There are plenty of 0 passenger trips on it too. Now that we don't know if these trips are justified, let's filter them out.

Let's do a similar exercise to the previous travel distance exercise. Since this is a continuous variable, we can plot the distribution of trips. Referring to the minimum and maximum distances, we draw a histogram with a more reasonable range.

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Histogram of trip distance for the New York taxi dataset

From the above chart we can see that the number of trips decreases as the distance increases. At a distance of about 100 miles, there is a large drop-off in distribution. For now, we will use this as a cutoff point to eliminate extreme outliers based on trip distance:

The presence of extreme outliers in the trip distance column was the motivation for examining taxi trip duration and average speed . These features are not readily available in the dataset, but are simple to calculate:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

The code block above requires zero memory and no time to execute! This is because the code causes virtual columns to be created. These columns only contain mathematical expressions and are only evaluated when needed, otherwise virtual columns behave like any other regular column. Note that other standard libraries require 10GB of RAM for the same operation.

Ok, let’s plot the distribution of travel times:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Histogram of the duration of more than 1 billion taxi trips in New York

From the graph above we can see that 95% of taxi trips took less than 30 minutes Reach your destination, although some journeys can take 4 to 5 hours. Can you imagine being stuck in a taxi in New York for over 3 hours? Anyway, let's be honest and consider all trips that last less than 3 hours in total:

Now let's study the average speed of taxis while also choosing a reasonable range for the data constraints:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Distribution of average taxi speeds according to

Above, we can infer that the average taxi speed ranges from 1 to 60 mph, so we can update the filtered DataFrame:

Let’s turn our attention to the cost of a taxi trip. From the output of the describe method, we can see that there are some outliers in the fare_amount, total_amount, and tip_amount columns. First, no values ​​in these columns should be negative. On the contrary, these figures suggest that some lucky drivers have become almost millionaires with only one taxi. Let's look at the distribution of these quantities within a relatively reasonable range:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNewsMany organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNewsMany organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

The distribution of fares, total amounts, and tips for more than 1 billion taxi rides in New York. Creating these graphs took just 31 seconds on my laptop!

We see that the above three distribution charts have long tails. Some values ​​in the tail may be legal, while other values ​​may be incorrect data entry. Regardless, let’s be conservative for now and only consider trips with a fare, total fare, and tip of less than $200. We also require that the fare amount, total amount value be greater than $0.

Finally, after the initial cleaning of all the data, let’s see how many taxi trips our analysis has.

We have over 1.1 billion more trips to go! With such a large amount of data, some valuable insights about taxi travel can be gained.

Get into the driver’s seat

Let’s say we are a future taxi driver, or manager of a taxi company, and are interested in using this dataset to learn how to maximize our profits, reduce our costs, or simply improve our work life.

Let's start by identifying the locations that, on average, generate the best revenue for picking up and dropping off passengers. Naively, we could draw a heat map of pick-up and drop-off locations, coded by average fares. However, taxi drivers have their own costs, for example, they have to pay for fuel. Therefore, taking passengers great distances may result in higher fares, but it also means greater fuel consumption and loss of time. Additionally, it may not be easy to find a passenger to take from a remote location to somewhere in the city center, so driving back without a passenger may be costly. One solution is to color-code the heatmap with the average of the ratio of fare to distance traveled. Let's consider these two methods:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNewsMany organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

New York City color heatmap encoding: average fare amount (left) and average ratio of fare amount to trip

In the simple case, when we only care about getting the maximum number of tickets for the service provided When prices are low, the best areas to pick up and drop off passengers are New York airports and major thoroughfares such as the Van Wyck Expressway and the Long Island Expressway. When we take the distance traveled into account, we get a slightly different picture. Van Wyck Highway, Long Island Highway Boulevard, and the airport are still great places to pick up and drop off passengers, but they are much less important on the map. However, there are some new hotspots popping up on the west side of the Hudson River that appear to be quite profitable.

Driving a taxi is a very flexible job. To make the most of this flexibility, know when it's most beneficial to drive.Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50GB or even 500GB in size.

But now, these datasets are not easy to use. They may be small enough to fit into the hard drive of your everyday laptop, or they may be large enough to match the RAM. As a result, they have become difficult to open and examine, let alone explore or analyze.

generally uses 3 strategies when processing these data sets. The first is to subsample the data. Its disadvantages are obvious: crucial parts may be missed, or worse, not looking at the whole story may distort the data and the facts it expresses. Another strategy is to use distributed computing. While this is an effective approach in some cases, it comes with significant overhead in managing and maintaining the cluster. Imagine what it would be like to have to set up a cluster for a data set that is not in the range of RAM (say, in the 30-50GB range). To me, this seems overwhelming. Alternatively, you can rent a powerful cloud instance with enough memory to process the relevant data. For example, AWS offers instances with megabytes of RAM. In this case, you still need to manage the cloud data bucket, wait for the data to be transferred from the bucket to the instance every time the instance starts, deal with the compliance issues that come with putting the data on the cloud, and handle All the inconveniences that come with going to work. Of course, not to mention the cost, which although the starting price is low, tends to get higher and higher over time.

In this article, I'll show you a new approach that makes it easier to do data science research with data of almost any size, as long as the data can be stored on the hard drive of a laptop, desktop, or server. Faster, safer and more convenient.

Vaex

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Vaex is an open source data framework library that enables visualization, exploration, analysis and even machine learning on tabular data sets that are as big as your hard drive. To do this, Vaex employs concepts such as memory mapping, efficient out-of-core algorithms, and deferred computation. All of this is tied together with a pandas-like API class that anyone can start using right away.

Billion Taxi Analysis

To illustrate this concept, let’s do a simple exploratory data analysis on a dataset that is large enough to fit into the RAM of a typical laptop. In this article, we will use the New York (NYC) Taxi Dataset, which contains over 1 billion iconic yellow taxis between 2009 and 2015. The data can be downloaded from this website and is provided in CSV format. The complete analysis can be viewed separately in this Jupyter notebook.

Cleaning up the streets

The first step is to convert the data into a memory mappable file format such as Apache Arrow, Apache Parquet or HDF5. An example of converting CSV data to HDF5 can be found here. Once the data is in a memory mappable format, opening it with Vaex is instant (0.052 seconds!), even though the data on disk exceeds 100GB:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Opening memory mapped files with Vaex only takes 0.052 seconds, even if they exceed 100 GB

Why is it so fast ? When opening a memory mapped file using Vaex, no data is actually read. Vaex only reads file metadata, such as the location of the data on disk, data structure (number of rows, columns, column names and types), file description, etc. So what if we want to inspect or interact with data? Open a dataset to generate a standard data frame and check if it is also fast:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

New York City Yellow Taxi Data Preview

Notice again that the unit execution time is very short. This is because displaying a Vaex dataframe or column only requires reading the first 5 rows and the last 5 rows from disk. This brings up another important issue: Vaex will only traverse the entire data set when necessary, and it will pass as little data as possible.

Anyway, let’s first clean this dataset from extreme outliers or bad data input. A good way to start is to get a high-level overview of your data using the describe method, which shows the number of samples for each column, the number of missing values, and the data type. If the column's data type is numeric, the mean, standard deviation, and minimum and maximum values ​​will also be displayed. All of these statistics are calculated with a single pass over the data.

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Use the describe method to get a high-level overview of the dataframe. Note that the dataframe contains 18 columns, but only the first 7 columns are visible in this screenshot. The

description method gives a good idea of ​​the power consumption and efficiency of Vaex: All these statistics are on my MacBook Pro (15", 2018 , 2.6GHz Intel Core i7, 32GB RAM). Other libraries or methods require distributed computing or 100GB+ cloud instances to perform the same calculation. With Vaex, all you need is data. , your laptop only needs a few GB of memory. Looking at the output of describe , it's easy to notice that the data contains some serious outliers. First, let's start by checking the pickup location and removing the outliers. The easy way is to simply plot the pick-up and drop-off locations and visually define the areas of New York City we wish to focus our analysis on. Since the dataset we are using is so large, the histogram is the most efficient visualization method to create with Vaex. and displaying histograms and heatmaps in is so fast that such plots can be much more interactive

df.plot_widget(df.pickup_longitude,

df.pickup_latitude,

shape=512,

limits='minmax',

f='log1p' ,

colormap='plasma')

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Once we have interactively decided which area of ​​New York City we want to focus on, we can simply create a filtered data frame:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

The coolest thing about the above code block is that it requires negligible memory! When filtering a Vaex dataframe, it does not make a copy of the data, instead it just creates a reference to the original object and applies a binary mask on it. The mask selects which rows are displayed and used in future calculations. 100GB of RAM saved.

Now, let's check the passenger count column. The maximum number of passengers recorded in a taxi trip is 255, which seems a bit extreme. Let's count the number of trips per passenger. It's easy to implement using the value counting method:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

applying the "value counts" method to 1 billion rows only takes about 20 seconds!

From the image above we can see that trips with more than 6 passengers may be rare outliers, or simply bad data entry. There are plenty of 0 passenger trips on it too. Now that we don't know if these trips are justified, let's filter them out.

Let's do a similar exercise to the previous travel distance exercise. Since this is a continuous variable, we can plot the distribution of trips. Referring to the minimum and maximum distances, we draw a histogram with a more reasonable range.

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Histogram of trip distance for the New York taxi dataset

From the above chart we can see that the number of trips decreases as the distance increases. At a distance of about 100 miles, there is a large drop-off in distribution. For now, we will use this as a cutoff point to eliminate extreme outliers based on trip distance:

The presence of extreme outliers in the trip distance column was the motivation for examining taxi trip duration and average speed . These features are not readily available in the dataset, but are simple to calculate:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

The code block above requires zero memory and no time to execute! This is because the code causes virtual columns to be created. These columns only contain mathematical expressions and are only evaluated when needed, otherwise virtual columns behave like any other regular column. Note that other standard libraries require 10GB of RAM for the same operation.

Ok, let’s plot the distribution of travel times:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Histogram of the duration of more than 1 billion taxi trips in New York

From the graph above we can see that 95% of taxi trips took less than 30 minutes Reach your destination, although some journeys can take 4 to 5 hours. Can you imagine being stuck in a taxi in New York for over 3 hours? Anyway, let's be honest and consider all trips that last less than 3 hours in total:

Now let's study the average speed of taxis while also choosing a reasonable range for the data constraints:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Distribution of average taxi speeds according to

Above, we can infer that the average taxi speed ranges from 1 to 60 mph, so we can update the filtered DataFrame:

Let’s turn our attention to the cost of a taxi trip. From the output of the describe method, we can see that there are some outliers in the fare_amount, total_amount, and tip_amount columns. First, no values ​​in these columns should be negative. On the contrary, these figures suggest that some lucky drivers have become almost millionaires with only one taxi. Let's look at the distribution of these quantities within a relatively reasonable range:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNewsMany organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNewsMany organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

The distribution of fares, total amounts, and tips for more than 1 billion taxi rides in New York. Creating these graphs took just 31 seconds on my laptop!

We see that the above three distribution charts have long tails. Some values ​​in the tail may be legal, while other values ​​may be incorrect data entry. Regardless, let’s be conservative for now and only consider trips with a fare, total fare, and tip of less than $200. We also require that the fare amount, total amount value be greater than $0.

Finally, after the initial cleaning of all the data, let’s see how many taxi trips our analysis has.

We have over 1.1 billion more trips to go! With such a large amount of data, some valuable insights about taxi travel can be gained.

Get into the driver’s seat

Let’s say we are a future taxi driver, or manager of a taxi company, and are interested in using this dataset to learn how to maximize our profits, reduce our costs, or simply improve our work life.

Let's start by identifying the locations that, on average, generate the best revenue for picking up and dropping off passengers. Naively, we could draw a heat map of pick-up and drop-off locations, coded by average fares. However, taxi drivers have their own costs, for example, they have to pay for fuel. Therefore, taking passengers great distances may result in higher fares, but it also means greater fuel consumption and loss of time. Additionally, it may not be easy to find a passenger to take from a remote location to somewhere in the city center, so driving back without a passenger may be costly. One solution is to color-code the heatmap with the average of the ratio of fare to distance traveled. Let's consider these two methods:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNewsMany organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

New York City color heatmap encoding: average fare amount (left) and average ratio of fare amount to trip

In the simple case, when we only care about getting the maximum number of tickets for the service provided When prices are low, the best areas to pick up and drop off passengers are New York airports and major thoroughfares such as the Van Wyck Expressway and the Long Island Expressway. When we take the distance traveled into account, we get a slightly different picture. Van Wyck Highway, Long Island Highway Boulevard, and the airport are still great places to pick up and drop off passengers, but they are much less important on the map. However, there are some new hotspots popping up on the west side of the Hudson River that appear to be quite profitable.

Driving a taxi is a very flexible job. To make the most of this flexibility, know when it's most beneficial to drive.To answer this question, let's make a chart showing the average ratio of fares to distance traveled for each day and hour of the day:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

The average ratio of fares to distance traveled for each day of the week and each hour of the day

The numbers above are It makes sense: the best revenue occurs during peak hours, especially in the middle of the weekday. As taxi drivers, a small portion of our income goes to the taxi company, so we might be interested in which customers tip the most on which day. So, let’s generate a similar graph, this time showing the average tip percentage:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Average tip percentage by day of week and hour of day

The above graph is interesting. It tells us that passengers like to tip taxi drivers between 7am and 10am and in the evening earlier in the week. If you're picking up a passenger at 3 or 4 a.m., don't expect a big tip. Combining the analysis of the two plots above, 8 to 10 a.m. is a good time to go to work: everyone can get a good fare per mile and a satisfactory tip.

start the engine!

In the previous part of this article, we briefly looked at the trip_distance column, and while cleaning it from outliers, we retained all trip values ​​less than 100 miles. That's still a sizable cutoff, especially considering that yellow cab companies primarily operate in Manhattan. The Trimih distance column describes the distance traveled by the taxi between picking up the passenger and where the passenger was dropped off. However, people may often choose a different route to avoid traffic jams or road works, for example. So, as a counterpart to the trip_distance column, let's calculate the shortest possible distance between the pick-up and drop-off locations, let's call it arc_distance:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

For complex expressions written in numpy, vaex can be used in Numba, Python or even CUDA ( If you have an NVIDIA GPU, use just-in-time compilation to greatly speed up calculations. The

arc length calculation formula covers a wide range of topics and contains a large number of trigonometric functions and algorithms. Especially when processing large data sets, the calculation amount is very large. If an expression or function is written using only Python operations and methods from the Numpy package, Vaex will compute it in parallel using all the cores of the machine. In addition to this, VAEX supports just-in-time compilation via NUBBA (using LLVM) or Pythran (accelerated via C++), providing better performance. If you happen to have an NVIDIA graphics card, you can use CUDA via the jit_CUDA method to get even better performance.

Anyway, let's plot the distribution of trip_distance and arc_distance:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNewsMany organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Left: comparison of trip distance and arc distance; right: arc distance

Interestingly, arc_distance never exceeds 21 miles, but the actual distance traveled by a taxi can be 5 twice as big. In fact, there are millions of taxi journeys where the drop-off location is within 100 meters (0.06 miles) of the pick-up location!

Yellow Taxi Company through the Years

The dataset we are using today spans 7 years. We can see how the quantities of some of the returns evolve over time. Using Vaex, we can quickly perform core grouping and aggregation operations. Let's explore how fares and trips have evolved over 7 years:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

For a Vaex dataframe with over 1 billion samples, a grouping operation of 8 aggregates takes less than 2 minutes on a laptop using a quad-core processor

In the cell block above, we perform a grouping operation, followed by 8 aggregations, 2 of which are on virtual columns. The above block of cells executes in less than 2 minutes on my laptop. This is quite impressive since the data we used contains over 1 billion samples. Anyway, let's see the results. Here's how the cost of driving a taxi has evolved over the years:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Average fare and total amount, and the percentage of tips passengers pay each year

We've seen taxi prices, as well as tips, increase over the years. Now let's look at the trip_distance and arc_distance of a taxi, which travels in years:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

The average trip and arc distance traveled by a taxi per year.

The chart above shows small increases in both travel distance and arc distance, meaning that, on average, people travel slightly further each year.

Show me the money

Before our journey comes to an end, let's make one more stop to investigate how passengers pay for their rides. The dataset contains a payment type column, so let's see what values ​​it contains:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

From the dataset documentation, we can see that there are only 6 valid entries for this column:

= credit card payment

= cash payment

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews = no charge

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews = dispute

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews = Unknown

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews =Voided trip

So we can simply map the entries in the payment_type column to integers:

Now we can group the data by year to see how New Yorkers' habits around taxi rental payments have changed:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNewsMany organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Year by Year Payment Methods

We have seen that over time, credit card payments slowly become more frequent than cash payments. We truly live in a digital age! Note that in the code block above, once we have aggregated the data, the small Vaex dataframe can be easily converted to a Pandas dataframe, which we can conveniently pass to Seaborn. Not trying to reinvent the wheel here.

Finally, let’s determine whether payment method depends on time of day or day of week by plotting the ratio of cash payments to credit card payments. To do this, we will first create a filter that selects only rides paid with cash or card. The next step is one of my favorite features of Vaex: aggregation with selection. Other libraries require aggregation of each individually filtered data frame that is later combined into one payment method. With Vaex, on the other hand, we can do this in one step by providing the selection in the aggregate function. This is very convenient and only needs to pass the data once, allowing for better performance. After this, we just plot the resulting dataframe in the standard way:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNewsMany organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Fraction of cash and card payments at a certain time and day of the week

Looking at the chart above, we can spot a similar pattern showing tip percentages and Functions related to day of the week and time of day. From both graphs, the data shows that passengers who pay by card tend to tip more than those who pay with cash. Is this really the case? I invite you to try and figure it out for yourself, now that you have the knowledge, tools, and data! You can also check out this notebook for some additional tips.

Getting there

I hope this article was a useful introduction to Vaex and that it will help alleviate some of the "uncomfortable data" issues you may be facing, at least when it comes to tabular datasets. If you are interested in the dataset used in this article, you can use it directly from S3 with Vaex. Check out the full Jupyter notebook to learn how to do this.

With Vaex, you can browse over a billion rows of data, calculate various statistics, aggregate data, and generate informative charts from your laptop in just seconds. Not only is it free and open source, I hope you'll give it a chance!

via: https://towardsdatascience.com/how-to-analyse-100s-of-gbs-of-data-on-your-laptop-with-python-f83363dda94

Lei Feng Network Lei Feng Network Lei Feng Network

The chart above shows small increases in both travel distance and arc distance, meaning that, on average, people travel slightly further each year.

Show me the money

Before our journey comes to an end, let's make one more stop to investigate how passengers pay for their rides. The dataset contains a payment type column, so let's see what values ​​it contains:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

From the dataset documentation, we can see that there are only 6 valid entries for this column:

= credit card payment

= cash payment

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews = no charge

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews = dispute

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews = Unknown

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews =Voided trip

So we can simply map the entries in the payment_type column to integers:

Now we can group the data by year to see how New Yorkers' habits around taxi rental payments have changed:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNewsMany organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Year by Year Payment Methods

We have seen that over time, credit card payments slowly become more frequent than cash payments. We truly live in a digital age! Note that in the code block above, once we have aggregated the data, the small Vaex dataframe can be easily converted to a Pandas dataframe, which we can conveniently pass to Seaborn. Not trying to reinvent the wheel here.

Finally, let’s determine whether payment method depends on time of day or day of week by plotting the ratio of cash payments to credit card payments. To do this, we will first create a filter that selects only rides paid with cash or card. The next step is one of my favorite features of Vaex: aggregation with selection. Other libraries require aggregation of each individually filtered data frame that is later combined into one payment method. With Vaex, on the other hand, we can do this in one step by providing the selection in the aggregate function. This is very convenient and only needs to pass the data once, allowing for better performance. After this, we just plot the resulting dataframe in the standard way:

Many organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNewsMany organizations are trying to collect and leverage as much data as possible to improve how they run their business, increase revenue, or have a greater impact on the world around them. As a result, it is increasingly common for data scientists to be faced with data sets of 50G - DayDayNews

Fraction of cash and card payments at a certain time and day of the week

Looking at the chart above, we can spot a similar pattern showing tip percentages and Functions related to day of the week and time of day. From both graphs, the data shows that passengers who pay by card tend to tip more than those who pay with cash. Is this really the case? I invite you to try and figure it out for yourself, now that you have the knowledge, tools, and data! You can also check out this notebook for some additional tips.

Getting there

I hope this article was a useful introduction to Vaex and that it will help alleviate some of the "uncomfortable data" issues you may be facing, at least when it comes to tabular datasets. If you are interested in the dataset used in this article, you can use it directly from S3 with Vaex. Check out the full Jupyter notebook to learn how to do this.

With Vaex, you can browse over a billion rows of data, calculate various statistics, aggregate data, and generate informative charts from your laptop in just seconds. Not only is it free and open source, I hope you'll give it a chance!

via: https://towardsdatascience.com/how-to-analyse-100s-of-gbs-of-data-on-your-laptop-with-python-f83363dda94

Lei Feng Network Lei Feng Network Lei Feng Network

hotcomm Category Latest News