Heart of Machine Report Project Author: CLUEbenchmark Participation: Si, Du Wei With this open source project, you no longer have to worry about finding a useful Chinese NLP data set. With 142 data sets, there is always one that suits you.

2024/05/2713:22:32 hotcomm 1394

Machine Heart reports

project Author: CLUE benchmark

Participants: Si, Du Wei

With this open source project, you no longer have to worry about finding a useful Chinese NLP data set. With 142 data sets, there is always one that suits you. One.

Heart of Machine Report Project Author: CLUEbenchmark Participation: Si, Du Wei With this open source project, you no longer have to worry about finding a useful Chinese NLP data set. With 142 data sets, there is always one that suits you. - DayDayNews

Chinese NLP data set search: https://www.cluebenchmarks.com/dataSet_search.html

On the road of no return of learning NLP, we will always find that most advanced algorithms and high-quality sample codes use English data sets. And when we are full of hope to migrate the model to the Chinese world, the lack of public high-quality data sets is simply a barrier. For example, the simplest language model and word embedding model only require segments of natural Chinese text. However, in fact, we will find that there are really few useful public large-scale corpora.

We need to find various projects that collect Chinese NLP data sets on GitHub and other platforms, and then choose according to our needs. It is worth noting that many domestic Chinese data sets are very old, and their use will be more troublesome. At this time, we need to make our own judgment and trial and error.

However, in this article, we will introduce a new Chinese NLP data search project, , which may be the most comprehensive Chinese NLP data set information collection project . This project collects more than one hundred pieces of Chinese NLP data information and displays the results in the form of search. We only need to type in keywords, or information such as the field to which the data set belongs, and we can find the corresponding data set.

Heart of Machine Report Project Author: CLUEbenchmark Participation: Si, Du Wei With this open source project, you no longer have to worry about finding a useful Chinese NLP data set. With 142 data sets, there is always one that suits you. - DayDayNews

Each search result will display the basic information of the data set, access links and other key information, which can help us quickly filter the data set. Because there are so many similar data sets found in every field, these brief overviews are very meaningful.

Heart of Machine Report Project Author: CLUEbenchmark Participation: Si, Du Wei With this open source project, you no longer have to worry about finding a useful Chinese NLP data set. With 142 data sets, there is always one that suits you. - DayDayNews

If readers want to see what data sets there are, they can directly check the GitHub address of the search project, and the information of all data sets is on it.

This may be the most comprehensive Chinese NLP data set

The NLP data set in this project includes NER, QA, sentiment analysis, text classification, text distribution, text summarization, machine translation, knowledge graph, corpus and reading comprehension, etc. There are 142 data sets in 10 categories.

Specifically, for each data set, the project author provides information such as the data set name, update time, data set provider, description, keywords, category, and paper address.

project address: https://github.com/CLUEbenchmark/CLUEDatasetSearch

Heart of Machine Report Project Author: CLUEbenchmark Participation: Si, Du Wei With this open source project, you no longer have to worry about finding a useful Chinese NLP data set. With 142 data sets, there is always one that suits you. - DayDayNews

This project is Chinese NLP data set classification.

However, since the entire project contains many types of data sets, Machine Heart only briefly introduces the sentiment analysis and text classification data sets.

Sentiment Analysis

As a common application of natural language processing (NLP), sentiment analysis is particularly suitable for classification methods aimed at extracting the emotional content of text. This project introduces 11 sentiment analysis data set sources , including NLPCC 2013/2014, Weibo Emotions Corpus, Zhijiang Cup E-commerce Review Opinion Mining Competition and 2019 Sohu Campus Algorithm Competition data set. Details of some sentiment analysis Chinese data sets in the

Heart of Machine Report Project Author: CLUEbenchmark Participation: Si, Du Wei With this open source project, you no longer have to worry about finding a useful Chinese NLP data set. With 142 data sets, there is always one that suits you. - DayDayNews

project.

Text classification

As the most commonly used and basic application in natural language processing, there are already many data sets in text classification. This project introduces 19 sources of text classification data sets, including Toutiao Chinese news (text) classification, THUCNews Chinese text classification, 2017 Zhihu Kanshan Cup machine learning challenge, and the University of Science and Technology of China news classification corpus, etc. Details of some text classification data sets in the

Heart of Machine Report Project Author: CLUEbenchmark Participation: Si, Du Wei With this open source project, you no longer have to worry about finding a useful Chinese NLP data set. With 142 data sets, there is always one that suits you. - DayDayNews

project.

Finally, developers can also contribute their own efforts by uploading data set information. By uploading 5 or more data set information, you can become a contributor to this project after passing the review. At present, it seems that 142 data sets are quite complete, but for more NLP subfield tasks, everyone still needs to jointly maintain it.

hotcomm Category Latest News