Pull to refresh

A selection of Datasets for Machine learning

Reading time5 min
Hi guys,

Before you is an article guide to open data sets for machine learning. In it, I, for a start, will collect a selection of interesting and fresh (relatively) datasets. And as a bonus, at the end of the article, I will attach useful links on independent search of datasets.

Less words, more data.


A selection of datasets for machine learning:

  • Data deaths and battles from the game of thrones — This data set combines three data sources, each based on information from a series of books.
  • Global Terrorism Database — Over 180,000 terrorist attacks worldwide, 1970-2017.
  • Bitcoin, historical data — Bitcoin data with an interval of 1 minute from selected exchanges, January 2012 — March 2019
  • FIFA 19 full set of player data — 18k + FIFA 19 players, ~ 90 attributes, extracted from the latest FIFA database.
  • YouTube video statistics — daily statistics of trend videos on YouTube.
  • Survey of suicide rates from 1985 to 2016 — Comparison of socio-economic information with suicide rates by year and country.
  • Huge stock market data set — historical daily prices and volumes of all US stocks and ETFs.
  • World Development Indicators — development indicators of countries from around the world.
  • Kaggle Machine Learning & Data Science Survey 2017 — Great insight into the state of data science and machine learning.
  • Data on violence and weapons — a full report on more than 260 thousand American weapon incidents in 2013-2018
  • Chest X-ray (pneumonia) — 5,863 images, 2 categories.
  • Gender recognition by voice — This database was created to identify the voice as male or female, based on the acoustic properties of voice and speech. The data set consists of 3168 recorded voice samples collected from men and women.
  • Student alcohol consumption — data was obtained in a survey of students in mathematics and Portuguese language courses in high school. It contains a lot of interesting social, gender and educational information about students.
  • Malaria Cell Dataset — cellular images to detect malaria.
  • Surveys of young people — data on the preferences, interests, habits, opinions and fears of young people.
  • World University Rankings — explore the best universities in the world.
  • Credit Card Fraud Detection — Anonymous credit card transactions are marked fraudulent or authentic.
  • Dataset heart disease — This database contains 76 attributes, such as age, gender, chest pain type, resting blood pressure and others.
  • European Football Base — 25 000+ matches, attributes of players and teams for European professional football.
  • Wine Reviews — 130k wine reviews with variety, location, winery, price and description.
  • Baidu Apolloscapes. A large dataset for recognizing 26 semantically different objects like cars, bicycles, pedestrians, buildings, street lamps, etc.
  • Comma.ai. More than seven hours driving on the highway. Dataset includes information about the speed of the vehicle, acceleration, steering angle and GPS coordinates.
  • Color recognition — This dataset contains 4242 color images. Data collection is based on flicr data, Google images, Yandex images.
  • Daily market price of each cryptocurrency — historical cryptocurrency prices for all tokens.
  • Chocolate rating — Expert rating of more than 1,700 chocolate bars.
  • Medical insurance market — data on health and dental plans for the US health insurance market.
  • Heartbeat sounds — classification of heartbeat abnormalities by stethoscope.
  • Anime Recommendations Database — recommendations from 76,000 users on myanimelist.net
  • Blood cell images — 12,500 images: 4 different types of cells.
  • Chest x-ray — over 112,000 chest radiographs from over 30,000 unique patients.
  • Murder reports, 1980-2014 — The Kill Responsibility Project is the most comprehensive homicide database in the United States currently available..
  • Used car database — over 370,000 used cars. The data content is in German, so you must first translate it if you do not speak German.
  • US Government Open Data House — data, tools and resources for conducting research, developing web applications and mobile applications, developing data visualizations.
  • National Center chronic disease prevention and health promotion (NCCDPHP). The center is working to reduce the risk factors for chronic diseases.
  • Largest in the UK a collection of social, economic and demographic resources.
  • EconData — нSeveral thousand economic time series, prepared by a number of US government agencies and distributed in various formats and media.
  • Coast Research Center — interesting data on the sea and its biological composition. Here you can find datasets from the analysis of data from the Red Sea model to the study of temperature and currents over the narrow southern California shelf.
  • Sign Language Digits Data Set — Turkey, Ankara, Ayranji, Anadolu. High school sign language data set.
  • Quality red wine — simple and clear practical data set for regression or classification modeling.
  • Spreadsheets English Football Premier League (1968-2019).
  • HotspotQA Dataset — Dataset with questions and answers, allowing you to create a system for answering questions in a more understandable way.
  • xView — one of the largest publicly available sets of aerial imagery of the earth. It contains images of various scenes from around the world, annotated with bounding boxes.
  • Labelme — Large annotated image dataset.
  • ImageNet — Dataset of images for new algorithms, organized according to the WordNet hierarchy, in which hundreds and thousands of images represent each node of the hierarchy.
  • LSUN. — Datasets of images, divided into scenes and categories with partial marking data.
  • MS COCO — large-scale dataset for detection and segmentation of objects.
  • COIL100 — 100 different objects depicted at every angle in a circular rotation.
  • Visual Genome — Dataset with ~ 100 thousand. Detailed annotated images.
  • Google’s Open Images. — a collection of 9 million URLs to images “tagged with more than 6,000 categories” under the Creative Commons license.
  • Labelled Faces in the Wild — a set of 13,000 marked face images of people for use of applications that involve the use of face recognition technology.
  • Stanford Dogs Dataset — contains 20,580 images of 120 dog breeds.
  • Indoor Scene Recognition. — Dataset for recognizing the interior of buildings. Contains 15,620 images and 67 categories.
  • Oxford’s Robotic Car — more than 100 repetitions of one route across Oxford, filmed during the year. Various combinations of weather conditions, traffic and pedestrians, as well as longer changes, like road works, got into datasets.
  • Cityscape Dataset — a large dataset containing records of a hundred street scenes in 50 cities.
  • KUL Belgium Traffic Sign Dataset — over 10,000 annotations of thousands of different traffic lights in Belgium.
  • LISA Laboratory for Intelligent & Safe Automobiles — Dataset with road signs, traffic lights, recognized vehicles and trajectories of movement.
  • Bosch Small Traffic Light Dataset — Dates with 24,000 annotated traffic lights.
  • WPI datasets — Dataset for recognition of traffic lights, pedestrians and road markings.
  • Berkeley DeepDrive — huge dataset for autopilots. It contains over 100,000 videos with more than 1,100 hours of driving records at different times of the day and in different weather conditions.
  • MIMIC-III — Datasets with impersonal data on the health status of ~ 40,000 patients on intensive care (demographic data, vital signs, laboratory tests and drugs).
  • Amazon Reviews — Contains about 35 million reviews from Amazon for 18 years. Data includes product and user information, ratings and the text of the review itself.

Useful links for searching datasets:

  • Surely Kaggle — meeting place for all fans of machine learning competitions.
  • Google Dataset Search — Search datasets throughout the Internet. Also, if necessary, you can add own data sets.
  • Machine Learning Repository — a set of databases, domain theories and data generators that are used by the machine learning community for empirical analysis of machine learning algorithms.
  • VisualData — dataset search for machine vision, with convenient classification by category.
  • DATA USA — complete set of publicly available US data with visualization, description and infographics.

On this, our short selection came to an end. If someone has something to add or share — write in the comments.

Only registered users can participate in poll. Log in, please.
What data could you collect?
0% The number of dead mosquitoes0
33.33% The amount of coffee consumed for life1
33.33% The number of mentions of your name when the project is released1
33.33% The data of their salary (actually not)1
3 users voted. 1 user abstained.
If this publication inspired you and you want to support the author, do not hesitate to click on the button
Total votes 12: ↑11 and ↓1+10