最全大数据源(集)下载列表(持续补充)
- Agriculture
- Biology
- Climate/Weather
- Complex Networks
- Computer Networks
- Data Challenges
- Earth Science
- Economics
- Education
- Energy
- Finance
- GIS
- Government
- Healthcare
- Image Processing
- Machine Learning
- Museums
- Natural Language
- Neuroscience
- Physics
- Psychology/Cognition
- Public Domains
- Search Engines
- Social Networks
- Social Sciences
- Software
- Sports
- Time Series
- Transportation
- Complementary Collections
- 1000 Genomes
- American Gut (Microbiome Project)
- Broad Bioimage Benchmark Collection (BBBC)
- Broad Cancer Cell Line Encyclopedia (CCLE)
- Cell Image Library
- Complete Genomics Public Data
- EBI ArrayExpress
- EBI Protein Data Bank in Europe
- Electron Microscopy Pilot Image Archive (EMPIAR)
- ENCODE project
- Ensembl Genomes
- Gene Expression Omnibus (GEO)
- Gene Ontology (GO)
- Global Biotic Interactions (GloBI)
- Harvard Medical School (HMS) LINCS Project
- Human Genome Diversity Project
- Human Microbiome Project (HMP)
- ICOS PSP Benchmark
- International HapMap Project
- Journal of Cell Biology DataViewer
- MIT Cancer Genomics Data
- NCBI Proteins
- NCBI Taxonomy
- NCI Genomic Data Commons
- NIH Microarray data or FTP (see FTP link on RAW)
- OpenSNP genotypes data
- Pathguid – Protein-Protein Interactions Catalog
- Protein Data Bank
- Psychiatric Genomics Consortium
- PubChem Project
- PubGene (now Coremine Medical)
- Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)
- Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)
- Sequence Read Archive(SRA)
- Stanford Microarray Data
- Stowers Institute Original Data Repository
- Systems Science of Biological Dynamics (SSBD) Database
- The Cancer Genome Atlas (TCGA), available via Broad GDAC
- The Catalogue of Life
- The Personal Genome Project or PGP
- UCSC Public Data
- UniGene
- Universal Protein Resource (UnitProt)
- Actuaries Climate Index
- Australian Weather
- Aviation Weather Center – Consistent, timely and accurate weather information for the world airspace system
- Brazilian Weather – Historical data (In Portuguese)
- Canadian Meteorological Centre
- Climate Data from UEA (updated monthly)
- European Climate Assessment & Dataset
- Global Climate Data Since 1929
- NASA Global Imagery Browse Services
- NOAA Bering Sea Climate
- NOAA Climate Datasets
- NOAA Realtime Weather Models
- NOAA SURFRAD Meteorology and Radiation Datasets
- The World Bank Open Data Resources for Climate Change
- UEA Climatic Research Unit
- WorldClim – Global Climate Data
- WU Historical Weather Worldwide
- AMiner Citation Network Dataset
- CrossRef DOI URLs
- DBLP Citation dataset
- DIMACS Road Networks Collection
- NBER Patent Citations
- Network Repository with Interactive Exploratory Analysis Tools
- NIST complex networks data collection
- Protein-protein interaction network
- PyPI and Maven Dependency Network
- Scopus Citation Database
- Small Network Data
- Stanford GraphBase (Steven Skiena)
- Stanford Large Network Dataset Collection
- Stanford Longitudinal Network Data Sources
- The Koblenz Network Collection
- The Laboratory for Web Algorithmics (UNIMI)
- The Nexus Network Repository
- UCI Network Data Repository
- UFL sparse matrix collection
- WSU Graph Database
- 3.5B Web Pages from CommonCrawl 2012
- 53.5B Web clicks of 100K users in Indiana Univ.
- CAIDA Internet Datasets
- ClueWeb09 – 1B web pages
- ClueWeb12 – 733M web pages
- CommonCrawl Web Data over 7 years
- CRAWDAD Wireless datasets from Dartmouth Univ.
- Criteo click-through data
- OONI: Open Observatory of Network Interference – Internet censorship data
- Open Mobile Data by MobiPerf
- Rapid7 Sonar Internet Scans
- UCSD Network Telescope, IPv4 /8 net
- Bruteforce Database
- Challenges in Machine Learning
- CrowdANALYTIX dataX
- D4D Challenge of Orange
- DrivenData Competitions for Social Good
- ICWSM Data Challenge (since 2009)
- Kaggle Competition Data
- KDD Cup by Tencent 2012
- Localytics Data Visualization Challenge
- Netflix Prize
- Space Apps Challenge
- Telecom Italia Big Data Challenge
- TravisTorrent Dataset – MSR’2017 Mining Challenge
- Yelp Dataset Challenge
- AQUASTAT – Global water resources and uses
- BODC – marine data of ~22K vars
- Earth Models
- EOSDIS – NASA’s earth observing system data
- Integrated Marine Observing System (IMOS) – roughly 30TB of ocean measurements or on S3
- Marinexplore – Open Oceanographic Data
- Smithsonian Institution Global Volcano and Eruption Database
- USGS Earthquake Archives
- American Economic Association (AEA)
- EconData from UMD
- Economic Freedom of the World Data
- Historical MacroEconomc Statistics
- International Economics Database and various data tools
- International Trade Statistics
- Internet Product Code Database
- Joint External Debt Data Hub
- Jon Haveman International Trade Data Links
- OpenCorporates Database of Companies in the World
- Our World in Data
- SciencesPo World Trade Gravity Datasets
- The Atlas of Economic Complexity
- The Center for International Data
- The Observatory of Economic Complexity
- UN Commodity Trade Statistics
- UN Human Development Reports
- AMPds
- BLUEd
- COMBED
- Dataport
- DRED
- ECO
- EIA
- HES – Household Electricity Study, UK
- HFED
- iAWE
- PLAID – the Plug Load Appliance Identification Dataset
- REDD
- Tracebase
- UK-DALE – UK Domestic Appliance-Level Electricity
- WHITED
- CBOE Futures Exchange
- Google Finance
- Google Trends
- NASDAQ
- NYSE Market Data (see FTP link on RAW)
- OANDA
- OSU Financial data
- Quandl
- St Louis Federal
- Yahoo Finance
- ArcGIS Open Data portal
- Cambridge, MA, US, GIS data on GitHub
- Factual Global Location Data
- Geo Spatial Data from ASU
- Geo Wiki Project – Citizen-driven Environmental Monitoring
- GeoFabrik – OSM data extracted to a variety of formats and areas
- GeoNames Worldwide
- Global Administrative Areas Database (GADM)
- Homeland Infrastructure Foundation-Level Data
- Landsat 8 on AWS
- List of all countries in all languages
- National Weather Service GIS Data Portal
- Natural Earth – vectors and rasters of the world
- OpenAddresses
- OpenStreetMap (OSM)
- Pleiades – Gazetteer and graph of ancient places
- Reverse Geocoder using OSM data & additional high-resolution data files
- TIGER/Line – U.S. boundaries and roads
- TwoFishes – Foursquare’s coarse geocoder
- TZ Timezones shapfiles
- UN Environmental Data
- World boundaries from the U.S. Department of State
- World countries in multiple formats
- A list of cities and countries contributed by community
- Open Data for Africa
- OpenDataSoft’s list of 1,600 open data
- EHDP Large Health Data Sets
- Gapminder World demographic databases
- GDC supports several cancer genome programs for CCG, TCGA, TARGET etc.
- PhysioBank Databases – a large and growing archive of physiological data
- Medicare Coverage Database (MCD), U.S.
- Medicare Data Engine of medicare.gov Data
- Medicare Data File
- MeSH, the vocabulary thesaurus used for indexing articles for PubMed
- Number of Ebola Cases and Deaths in Affected Countries (2014)
- Open-ODS (structure of the UK NHS)
- OpenPaymentsData, Healthcare financial relationship data
- The Cancer Genome Atlas project (TCGA) (refer to GDC and BigQuery table)
- World Health Organization Global Health Observatory
- 10k US Adult Faces Database
- 2GB of Photos of Cats or Archive version
- Adience Unfiltered faces for gender and age classification
- Affective Image Classification
- Animals with attributes
- Caltech Pedestrian Detection Benchmark
- Chars74K dataset, Character Recognition in Natural Images (both English and Kannada are available)
- Face Recognition Benchmark
- Flickr: 32 Class Brand Logos
- GDXray: X-ray images for X-ray testing and Computer Vision
- ImageNet (in WordNet hierarchy)
- Indoor Scene Recognition
- International Affective Picture System, UFL
- Massive Visual Memory Stimuli, MIT
- MNIST database of handwritten digits, near 1 million examples
- Several Shape-from-Silhouette Datasets
- Stanford Dogs Dataset
- SUN database, MIT
- The Action Similarity Labeling (ASLAN) Challenge
- The Oxford-IIIT Pet Dataset
- Violent-Flows – Crowd Violence Non-violence Database and benchmark
- Visual genome
- YouTube Faces Database
- Context-aware data sets from five domains
- Delve Datasets for classification and regression (Univ. of Toronto)
- Discogs Monthly Data
- eBay Online Auctions (2012)
- IMDb Database
- Keel Repository for classification, regression and time series
- Labeled Faces in the Wild (LFW)
- Lending Club Loan Data
- Machine Learning Data Set Repository
- Free Music Archive
- Million Song Dataset
- More Song Datasets
- MovieLens Data Sets
- New Yorker caption contest ratings
- RDataMining – “R and Data Mining” ebook data
- Registered Meteorites on Earth
- Restaurants Health Score Data in San Francisco
- UCI Machine Learning Repository
- Yahoo! Ratings and Classification Data
- Youtube 8m
- Canada Science and Technology Museums Corporation’s Open Data
- Cooper-Hewitt’s Collection Database
- Minneapolis Institute of Arts metadata
- Natural History Museum (London) Data Portal
- Rijksmuseum Historical Art Collection
- Tate Collection metadata
- The Getty vocabularies
- POS/NER/Chunk annotated data
- Automatic Keyphrase Extraction
- Blogger Corpus
- CLiPS Stylometry Investigation Corpus
- ClueWeb09 FACC
- ClueWeb12 FACC
- DBpedia – 4.58M things with 583M facts
- Flickr Personal Taxonomies
- Freebase.com of people, places, and things
- Google Books Ngrams (2.2TB)
- Google MC-AFP, generated based on the public available Gigaword dataset using Paragraph Vectors
- Google Web 5gram (1TB, 2006)
- Gutenberg eBooks List
- Hansards text chunks of Canadian Parliament
- Machine Comprehension Test (MCTest) of text from Microsoft Research
- Machine Translation of European languages
- Making Sense of Microposts 2013 – Concept Extraction
- Making Sense of Microposts 2016 – Named Entity rEcognition and Linking
- Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)
- Multi-Domain Sentiment Dataset (version 2.0)
- Open Multilingual Wordnet
- Personae Corpus
- SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles)
- SMS Spam Collection in English
- Universal Dependencies
- USENET postings corpus of 2005~2011
- Webhose – News/Blogs in multiple languages
- Wikidata – Wikipedia databases
- Wikipedia Links data – 40 Million Entities in Context
- WordNet databases and tools
- Allen Institute Datasets
- Brain Catalogue
- Brainomics
- CodeNeuro Datasets
- Collaborative Research in Computational Neuroscience (CRCNS)
- FCP-INDI
- Human Connectome Project
- NDAR
- NeuroData
- Neuroelectro
- NIMH Data Archive
- OASIS
- OpenfMRI
- Study Forrest
- CERN Open Data Portal
- Crystallography Open Database
- NASA Exoplanet Archive
- NSSDC (NASA) data of 550 space spacecraft
- Sloan Digital Sky Survey (SDSS) – Mapping the Universe
- Amazon
- Archive-it from Internet Archive
- Archive.org Datasets
- CMU JASA data archive
- CMU StatLab collections
- Data.World
- Data360
- Infochimps
- KDNuggets Data Collections
- Microsoft Azure Data Market Free DataSets
- Microsoft Data Science for Research
- Numbray
- Open Library Data Dumps
- Reddit Datasets
- RevolutionAnalytics Collection
- Sample R data sets
- Stats4Stem R data sets
- StatSci.org
- The Washington Post List
- UCLA SOCR data collection
- UFO Reports
- Wikileaks 911 pager intercepts
- Yahoo Webscope
- Academic Torrents of data sharing from UMB
- Datahub.io
- DataMarket (Qlik)
- Harvard Dataverse Network of scientific data
- ICPSR (UMICH)
- Institute of Education Sciences
- National Technical Reports Library
- Open Data Certificates (beta)
- OpenDataNetwork – A search engine of all Socrata powered data portals
- Statista.com – statistics and Studies
- Zenodo – An open dependable home for the long-tail of science
- 72 hours #gamergate Twitter Scrape
- Ancestry.com Forum Dataset over 10 years
- Cheng-Caverlee-Lee September 2009 – January 2010 Twitter Scrape
- CMU Enron Email of 150 users
- EDRM Enron EMail of 151 users, hosted on S3
- Facebook Data Scrape (2005)
- Facebook Social Networks from LAW (since 2007)
- Foursquare from UMN/Sarwat (2013)
- GitHub Collaboration Archive
- Google Scholar citation relations
- High-Resolution Contact Networks from Wearable Sensors
- Indie Map: social graph and crawl of top IndieWeb sites
- Mobile Social Networks from UMASS
- Network Twitter Data
- Reddit Comments
- Skytrax’ Air Travel Reviews Dataset
- Social Twitter Data
- SourceForge.net Research Data
- Twitter Data for Online Reputation Management
- Twitter Data for Sentiment Analysis
- Twitter Graph of entire Twitter site
- Twitter Scrape Calufa May 2011
- UNIMI/LAW Social Network Datasets
- Yahoo! Graph and Social Data
- Youtube Video Social Graph in 2007,2008
- ACLED (Armed Conflict Location & Event Data Project)
- Canadian Legal Information Institute
- Center for Systemic Peace Datasets – Conflict Trends, Polities, State Fragility, etc
- Correlates of War Project
- Cryptome Conspiracy Theory Items
- Datacards
- European Social Survey
- FBI Hate Crime 2013 – aggregated data
- Fragile States Index
- GDELT Global Events Database
- General Social Survey (GSS) since 1972
- German Social Survey
- Global Religious Futures Project
- Humanitarian Data Exchange
- INFORM Index for Risk Management
- Institute for Demographic Studies
- International Networks Archive
- International Social Survey Program ISSP
- International Studies Compendium Project
- James McGuire Cross National Data
- MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste
- Minnesota Population Center
- MIT Reality Mining Dataset
- Notre Dame Global Adaptation Index (NG-DAIN)
- Open Crime and Policing Data in England, Wales and Northern Ireland
- Paul Hensel General International Data Page
- PewResearch Internet Survey Project
- PewResearch Society Data Collection
- Political Polarity Data
- StackExchange Data Explorer
- Terrorism Research and Analysis Consortium
- Texas Inmates Executed Since 1984
- Titanic Survival Data Set or on Kaggle
- UCB’s Archive of Social Science Data (D-Lab)
- UCLA Social Sciences Data Archive
- UN Civil Society Database
- Universities Worldwide
- UPJOHN for Labor Employment Research
- Uppsala Conflict Data Program
- World Bank Open Data
- WorldPop project – Worldwide human population distributions
- Betfair Historical Exchange Data
- Cricsheet Matches (cricket)
- Ergast Formula 1, from 1950 up to date (API)
- Football/Soccer resources (data and APIs)
- Lahman’s Baseball Database
- Pinhooker: Thoroughbred Bloodstock Sale Data
- Retrosheet Baseball Statistics
- Tennis database of rankings, results, and stats for ATP, WTA, Grand Slams and Match Charting Project
- Databanks International Cross National Time Series Data Archive
- Hard Drive Failure Rates
- Heart Rate Time Series from MIT
- Time Series Data Library (TSDL) from MU
- UC Riverside Time Series Dataset
- Airlines OD Data 1987-2008
- Bay Area Bike Share Data
- Bike Share Systems (BSS) collection
- GeoLife GPS Trajectory from Microsoft Research
- German train system by Deutsche Bahn
- Hubway Million Rides in MA
- Marine Traffic – ship tracks, port calls and more
- Montreal BIXI Bike Share
- NYC Taxi Trip Data 2009-
- NYC Taxi Trip Data 2013 (FOIA/FOILed)
- NYC Uber trip data April 2014 to September 2014
- Open Traffic collection
- OpenFlights – airport, airline and route data
- Philadelphia Bike Share Stations (JSON)
- Plane Crash Database, since 1920
- RITA Airline On-Time Performance data
- RITA/BTS transport data collection (TranStat)
- Toronto Bike Share Stations (XML file)
- Transport for London (TFL)
- Travel Tracker Survey (TTS) for Chicago
- U.S. Bureau of Transportation Statistics (BTS)
- U.S. Domestic Flights 1990 to 2009
- U.S. Freight Analysis Framework since 2007
- Data Packaged Core Datasets
- Database of Scientific Code Contributions
- A growing collection of public datasets: CoolDatasets.
- DataWrangling: Some Datasets Available on the Web
- Inside-r: Finding Data on the Internet
- OpenDataMonitor: An overview of available open data resources in Europe
- Quora: Where can I find large datasets open to the public?
- RS.io: 100+ Interesting Data Sets for Statistics
- StaTrek: Leveraging open data to understand urban lives
- United States Census Data: The United States Census publishes reams of demographic data at the state, city, and even zip code level. The data set is fantastic for creating geographic data visualizations and can be accessed on the Census Website. Alternatively, the data can be accessed via an API. One convenient way to use that API is through the chloroplethr. In general, this data is very clean and very comprehensive.
- FBI Crime Data: The FBI crime data set is fascinating. If you’re interested in analyzing time series data, you can use it to chart changes in crime rates at the national level over a 20 year period. Alternatively, you can look at the data geographically.
- CDC Cause of Death: The Center for Disease Control control maintains a database on cause of death. The data can be segmented in almost every way imaginable: age, race, year, and so on.
- Medicare Hospital Quality: Medicare maintains a database on complication rates by hospital that provides for interesting comparisons.
- SEER Cancer Incidence: The US government also has data about cancer incidence, again segmented by age, race, gender, year, and other factors.
- Bureau of Labor Statistics: Many important economic indicators for the United States (like unemployment and inflation) can be found on the Bureau of Labor Statistics website. Most of the data can be segmented both by time and by geography.
- The Bureau of Economic Analysis: The Bureau of Economic Analysis also has national and regional economic data, like GDP and exchange rates.
- IMF Economic Data: If you want a view of international data, you can find it on the IMF website.
- Dow Jones Weekly Returns: Predicting stock prices is a major application of data analysis and machine learning. One dataset to explore is the weekly returns of the Dow Jones Index.
- Boston Housing Data: The Boston Housing Data Set contains median housing prices in Boston suburbs as well as 13 attributes that contribute to those prices. It’s an excellent set for experimenting with various types of regressions.
- Enron Emails: After the collapse of Enron, a dataset of roughly 500,000 emails with message text and metadata were released. The dataset is now famous and provides an excellent testing ground for text related analysis. It has the messiness of real world data.
- Google N-Grams: If you’re interested in truly massive data, the Google n-gramsdataset counts the frequency of words and phrases by year across a huge number of text sources. The resulting file is 2.2 TB.
- Sentence Sentiments: Researchers have labeled 3,000 sentences as expressing positive or negative sentiments. If you’re interested in classifying text, this is a great place to start.
- Reddit Comments: Reddit released a dataset of every comment that has ever been made on the site. That’s over a terabyte of data uncompressed, so if you want a smaller dataset to work with Kaggle has hosted the comments from May 2015 on their site.
- Wikipedia: Wikipedia provides instructions for downloading the text of English language articles.
- Lending Club: Lending Club provides data about loan applications it has rejected as well as the performance of loans that it issued. The dataset lends itself both to categorization techniques (will a given loan default) as well as regressions (how much will be paid back on a given loan.)
- Walmart: Walmart has released store level sales data for 98 items across 45 stores. This is an excellent data for time series analysis and has interesting seasonal components as well.
- Airbnb: This website offers different datasets related to Airbnb and listings related to different cities.
- Yelp: Yelp releases an academic dataset that contains information for the areas around 30 universities.
Cross-disciplinary data repositories, data collections and data search engines:
- http://datasource.kapsarc.org
- https://www.kaggle.com/datasets
- http://www.assetmacro.com
- http://usgovxml.com
- http://aws.amazon.com/datasets
- http://databib.org
- http://datacite.org
- http://figshare.com
- http://linkeddata.org
- http://reddit.com/r/datasets
- http://thewebminer.com/
- http://thedatahub.org alias http://ckan.net
- http://quandl.com
- Social Network Analysis Interactive Dataset Library (Social Network Datasets)
- Datasets for Data Mining
- Enigma Public
- http://www.ufindthem.com/
- http://NetworkRepository.com – The First Interactive Network Data Repository
- http://MLvis.com
- Open Data Inception – A Comprehensive List of 2500+ Open Data Portals in the World
- http://data.opendatasoft.com OpenDataSoft catalog
Single datasets and data repositories
- http://archive.ics.uci.edu/ml/
- http://crawdad.org/
- http://data.austintexas.gov
- http://data.cityofchicago.org
- http://data.govloop.com
- http://data.gov.uk/
- data.gov.in
- http://data.medicare.gov
- http://data.seattle.gov
- http://data.sfgov.org
- http://data.sunlightlabs.com
- https://datamarket.azure.com/
- http://developer.yahoo.com/geo/g…
- http://econ.worldbank.org/datasets
- http://en.wikipedia.org/wiki/Wik…
- http://factfinder.census.gov/ser…
- http://ftp.ncbi.nih.gov/
- http://gettingpastgo.socrata.com
- http://googleresearch.blogspot.c…
- http://books.google.com/ngrams/
- http://medihal.archives-ouvertes.fr
- http://public.resource.org/
- http://rechercheisidore.fr
- http://snap.stanford.edu/data/in…
- http://timetric.com/public-data/
- https://wist.echo.nasa.gov/~wist…
- http://www2.jpl.nasa.gov/srtm
- http://www.archives.gov/research…
- http://www.bls.gov/
- http://www.crunchbase.com/
- http://www.dartmouthatlas.org/
- http://www.data.gov/
- http://www.datakc.org
- http://dbpedia.org
- http://www.delicious.com/jbaldwi…
- http://www.faa.gov/data_research/
- http://www.factual.com/
- http://research.stlouisfed.org/f…
- http://www.freebase.com/
- http://www.google.com/publicdata…
- http://www.guardian.co.uk/news/d…
- http://www.infochimps.com
- http://www.kaggle.com/
- http://build.kiva.org/
- http://www.nationalarchives.gov….
- http://www.nyc.gov/html/datamine…
- http://www.ordnancesurvey.co.uk/…
- http://www.philwhln.com/how-to-g…
- http://www.imdb.com/interfaces
- http://imat-relpred.yandex.ru/en…
- http://www.dados.gov.pt/pt/catal…
- http://knoema.com
- http://daten.berlin.de/
- http://www.qunb.com
- http://databib.org/
- http://datacite.org/
- http://data.reegle.info/
- http://data.wien.gv.at/
- http://data.gov.bc.ca
- https://pslcdatashop.web.cmu.edu/ (interaction data in learning environments)
- http://www.icpsr.umich.edu/icpsrweb/CPES/ – Collaborative Psychiatric Epidemiology Surveys: (A collection of three national surveys focused on each of the major ethnic groups to study psychiatric illnesses and health services use)
- http://www.dati.gov.it
- http://dati.trentino.it
- http://www.databagg.com/
- http://networkrepository.com – Network/ML data repository w/ visual interactive analytics
- Home (United Nations Environment Programme Grid Genava a lot of GIS datasets
More than 1 TB
- The 1000 Genomes project makes 260 TB of human genome data available [13]
- The Internet Archive is making an 80 TB web crawl available for research [17]
- The TREC conference made the ClueWeb09 [3] dataset available a few years back. You’ll have to sign an agreement and pay a nontrivial fee (up to $610) to cover the sneakernet data transfer. The data is about 5 TB compressed.
- ClueWeb12 [21] is now available, as are the Freebase annotations, FACC1 [22]
- CNetS at Indiana University makes a 2.5 TB click dataset available [19]
- ICWSM made a large corpus of blog posts available for their 2011 conference [2]. You’ll have to register (an actual form, not an online form), but it’s free. It’s about 2.1 TB compressed.
- The Yahoo News Feed dataset is 1.5 TB compressed, 13.5 TB uncompressed
- The Proteome Commons makes several large datasets available. The largest, the Personal Genome Project [11], is 1.1 TB in size. There are several others over 100 GB in size.
More than 1 GB
- The Reference Energy Disaggregation Data Set [12] has data on home energy use; it’s about 500 GB compressed.
- The Tiny Images dataset [10] has 227 GB of image data and 57 GB of metadata.
- The ImageNet dataset [18] is pretty big.
- The MOBIO dataset [14] is about 135 GB of video and audio data
- The Yahoo! Webscope program [7] makes several 1 GB+ datasets available to academic researchers, including an 83 GB data set of Flickr image features and the dataset used for the 2011 KDD Cup [9], from Yahoo! Music, which is a bit over 1 GB.
- Google made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The dataset is about 10 GB compressed.
- Yandex has recently made a very large web search click dataset available [1]. You’ll have to register online for the contest to download. It’s about 5.6 GB compressed.
- Freebase makes regular data dumps available [5]. The largest is their Quad dump [4], which is about 3.6 GB compressed.
- The Open American National Corpus [8] is about 4.8 GB uncompressed.
- Wikipedia made a dataset containing information about edits available for a recent Kaggle competition [6]. The training dataset is about 2.0 GB uncompressed.
- The Research and Innovative Technology Administration (RITA) has made available a dataset about the on-time performance of domestic flights operated by large carriers. The ASA compressed this dataset and makes it available for download [16].
- The wiki-links data made available by Google is about 1.75 GB total [20].
Economics
- American Economic Ass. (AEA): AEAweb: RFE
- UMD:: Inforum – EconData
- World bank: Indicators | Data
Finance
- CBOE Futures Exchange: CFE | Market Data
- Google Finance: Stock market quotes, news, currency conversions & more(R)
- Google Trends: Google Trends – Web Search interest – Worldwide, 2004 – present
- St Louis Fed: Federal Reserve Economic Data (R)
- NASDAQ: NASDAQ – Datastore
- OANDA: Forex Trading | Trade Currency Online | Forex Broker | OANDA(R)
- Quandl: Find, Use and Share Numerical Data
- Yahoo Finance: Yahoo Finance – Business Finance, Stock Market, Quotes, News (R)
Government
- Archived national government statistics: Web Archiving Services for Libraries and Archives
- Australia: 3301.0 – Births, Australia, 2009
- Canada: Home | data.gc.ca
- DataMarket: DataMarket – Find, Understand and Share Data – DataMarket
- Fed Stats: FedStats: Subjects A to Z
- Guardian world governments: Page on guardian.co.uk
- London, U.K. data: Catalogue | London DataStore
- NewZealand: http://www.stats.govt.nz/tools_and_services/tools/TableBuilder/tables-by…
- NYC data: NYC Open Data
- OECD: Page on oecd.org
- RITA: RITA | BTS | Title from h2
- San Francisco Data sets: Data | San Francisco
- U.K. Government Data: Data Search | data.gov.uk
- United Nations: UNdata
- U.S. Federal Government Agencies: Federal Agency Participation – Data.gov
- US CDC Public Health datasets: Public-Use Data Files and Documentation
- The World Bank: World Development Report
- UK 2011 Census Open Atlas Project: Page on alex-singleton.com
Health Care
- Gapminder: Data
Machine Learning
- Airlines Data (2009 ASA Challenge): The data. Data expo 09. ASA Statistics Computing and Graphics
- Airports and their locations: Airports and Their Locations
- AppliedPredictiveModeling (R package): Page on bit.ly
- Australian Weather: Daily Weather Observations
- Causality Workbench: Data – Repository – Causality Workbench
- Edge data for US domestic flights 1990 to 2009: US Domestic Flights From 1990 to 2009
- GroupLens Research (movie ratings and more): Datasets
- Kaggle competition data: Go from Big Data to Big Analytics
- KDNuggets competition site: Datasets for Data Mining and Data Science
- The Koblenz Network Collection: The Koblenz Network Collection
- Machine Learning Data Set Repository: mldata :: Welcome
- Medicare Data File: Page on cms.gov
- Microsoft Research: Our research – Microsoft Research
- Million songs: The Million Song Dataset: Giving Back to Music Research
- RDataMining.com: R and Data Mining R and Data Mining ebook data:Data – RDataMining.com: R and Data Mining
- The Revolution Analytics Collection: Index of /datasets/
- Social Networking: Ancestry.com Forum Dataset
- UCI Machine Learning Repository: UCI Machine Learning Repository
- 53.5 billion clicks: Center for Complex Networks and Systems Research
Public Domain Collections
- Data360: Data360 Homepage
- Page on datamob.org: Page on datamob.org
- Factual: Page on factual.com
- Freebase: Freebase
- Google: Google Public Data Explorer
- infochimps: Big Data – Cloud Services
- numbray: Page on numbrary.com
- Sample R data sets: The R Datasets Package (R)
- SourceForge Research Data: Data
- UFO Reports: National UFO Reporting Center Web Reports
- Wikileaks 911 pager intercepts: 9/11 Pager data
- Resources for AP Statistics, Intro to Statistics, and R | STATS4STEM.ORG: R data sets: Statistical Data Sets, Statistics Data Sets, Data Sets For Statistics, R Datasets (R)
- The Washington Post List: Post Databases (washingtonpost.com)
Science
- Agricultural Experiments: agridat {agridat} (R)
- Climate data: Temperature data (HadCRUT4)andftp://ftp.cmdl.noaa.gov/
- Gene Expression Omnibus: Home – GEO – NCBI
- Geo Spatial Data: Data | GeoDa Center
- Human Microbiome Project: Microbial Reference Genomes
- MIT Cancer Genomics Data: Page on broadinstitute.org
- NASA: Obtaining Data From the NSSDC
- NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/D… (R)
- Protein structure: PSP benchmark
- Public Gene Data: Browse literature or sequence neighbours
- Stanford Microarray Data: Page on stanford.edu
Social Sciences
- General Social Survey: General Social Survey
- ICPSR: Page on umich.edu
- SNAP: Stanford Large Network Dataset Collection
- UCLA Social Sciences Archive: Data Portals
- UPJOHN INST: Employment Research Data Center
Time Series
- Time Series data Library: Time Series Data Library
Universities
- Carnegie Mellon University Enron email: Enron Email Dataset
- Carnegie Mellon University StatLab: StatLib—Datasets Archive
- Carnegie Mellon University JASA data archive: StatLib—JASA Data Archive
- Ohio State University Financial data: Financial Data Finder
- UC Berkeley: UC DATA :HOME
- UCLA: SOCR Data – Socr
- UC Riverside Time Series: Welcome to the UCR Time Series Classification/Clustering Page
- University of Toronto: Delve Datasets
- Data.gov (USA),
- The World Bank DataBank
- http://www.reddit.com/r/datasets
- A Deep Catalog of Human Genetic Variation (Size: 396.7TB)
- City of Chicago | Data Portal(Size: 9.5GB)
- Google Ngram Viewer Size: 863.4GB
- Open Government (Canada)
- Education – Data.gov (Education)
- School of Geographical Sciences & Urban Planning Geo-data