Brokers of consumer data are currently under intense media and governmental scrutiny.[i] As a result, using a search engine to find legitimate data sources has become surprisingly tedious. This page lists some important sources of commercial, government, and open-source data that your organization can use to create or enrich your data models. If you’d like to recommend a data source to add to our list, please drop us a note.
Here are some leading data marketplaces:
Quandl curates over eight million financial, economic, and social datasets.
DataMarket is an open portal to thousands of datasets from diverse providers. It also lets users contribute datasets.
Microsoft’s Windows Azure Marketplace focuses on high-quality data sets and services.
Factual focuses on geospatial, mobile, and consumer-product data and services.
Bluekai.com’s Audience Data Marketplace sells “actionable audience data on more than 300 million users, . . . over 80% of the entire U.S. Internet population.”
Bluekai’s Data Activation System lets you monetize your data sets.
Handshake lets individuals market their personal data.
Brilig is a “cooperative data marketplace for online advertising that enables buyers and sellers to trade reach and relevance.”
There are many commercial data brokers. Here is a partial list:
Arcametrics sells tailored prospect lists.
Archives sells genealogical data.
Experian sells a variety of business profiling data.
FICO sells several commercial scoring services, such as bankruptcy and fraud scores.
i360 sells voter and consumer data.
iBehavior sells subscriptions to 12 billion transaction records, as well as 30 million business contacts and 190 million individuals.
IHS sells a wide variety of raw data and analytical/forecasting results for many industries, such as automotive, defense, energy, and maritime.
IXI sells financial, economic, behavioral, and demographic consumer data, as well as geocoding services. The IXI customer can query IXI’s consumer database using financial, economic, demographic, and geographic constraints. IXI also sells bespoke business prospecting lists.
Profound Networks sells profiles of corporate IP networks and network activity. These data sets can be applied in real time to target online advertising to business visitors.
RapLeaf enriches your list of email addresses with additional data, as well as analyzing your email list over various dimensions.
Relevate sells business and consumer data sets and data-enrichment services, including realtime services.
Rosslyn Analytics cleanses and enriches business data.
TLO (now owned by TransUnion) maintains the TLOxp database, which lets corporations and law-enforcement agencies investigate individuals, locations, and companies.
V12 Group has a collection of databases, including consumer, business, automotive, personality, and online audience, which you can query over many dimensions.
Wikipedia maintains a list of financial-data feeds.
A Google search for ‘historical market data’ returns a good list of market-history data providers.
Google Books has digitized over 10% of the world’s books. You can search Google Books by keyword, plot a phrase’s popularity over time, or use the Google Books API to integrate book search into a text-mining application. Google has also published a database of one-trillion-word N-grams from public Web pages.
BYU publishes a list of publicly available corpora and N-grams.
The widely used Reuters-21578 corpus is one of many datasets available at the UC Irvine repository.
Wikileaks freely distributes its entire corpus.
One can search Twitter using the R twitteR library.
The Enron email corpus is a famous collection frequently use for information-extraction research.
There are far too many open source, research, and academic data sets online to catalog here. These data sets are relatively easy to find with your favorite search engine. Here are a few examples:
Amazon. com maintains its Public Data Sets that you can easily integrate into Amazon AWS applications.
The UC Irvine Machine Learning Repository maintains nearly 300 open-source datasets.
MIT Libraries publishes a list of APIs interfacing to various research datasets.
Indiana University published 53.5 billion Web clicks made by university users.
Pew Research publishes several social science, political science, and news datasets.
The American Psychological Association publishes a list of psychology-research datasets and repositories.
The U.S. Federal Government publishes a great deal of non-classified data online at data.gov.
The World Bank publishes many free data sets online.
Some U.S. states publish data; see for example Colorado’s data.
Facebook’s social graph is the largest social-network dataset in the world. The Facebook Platform gives third parties access to Facebook’s social graph.
Gnip provides “full firehose” access to several other social datasets, including Foursquare, Tumblr, WordPress, and Twitter; plus “managed access” to the APIs of Facebook, YouTube, Instagram, Google+, Flickr, and other online social services.
U.S. government geospatial data is available online.
The U.S. Department of Agriculture maintains its Geospatial Data Gateway of environmental and natural-resource data.
The U.S. Environmental Protection Agency publishes a great deal of scientific data.
Pitney Bowes markets geospatial data.
The North Carolina State University libraries maintain a great list of U.S. geospatial data sources.
The National Weather Service has a suite of weather-forecast products. See for example its Winter Weather Forecasts.
Pitney Bowes markets a Risk Data Suite Weather Bundle.
[i] See for example http://www.enterprisecioforum.com/en/blogs/jdodge/data-brokers-black-eye-big-data and the Committee on Commerce, Science, and Transportation’s report, “A Review of the Data Broker Industry: Collection, Use, and Sale of Consumer Data for Marketing Purposes” (U.S. Senate, 2013), http://www.commerce.senate.gov/public/?a=Files.Serve&File_id=0d2b3642-6221-4888-a631-08f2f255b577 (visited March 16, 2014).