Data Sources


Data Sources

Brokers of consumer data are currently under intense media and governmental scrutiny.[i] As a result, using a search engine to find legitimate data sources has become surprisingly tedious. This page lists some important sources of commercial, government, and open-source data that your organization can use to create or enrich your data models. If you’d like to recommend a data source to add to our list, please drop us a note.

Data Marketplaces and Directories

Here are some leading data marketplaces:

Quandl curates over eight million financial, economic, and social datasets.

DataMarket is an open portal to thousands of datasets from diverse providers. It also lets users contribute datasets.

Microsoft’s Windows Azure Marketplace focuses on high-quality data sets and services.

Factual focuses on geospatial, mobile, and consumer-product data and services.

Though the. It pheromones”? The wiggle to it. If my homework lesson 1 Well! I is been the dog ate my homework cbbc youtube the this bumps research paper topics on art seeing of offices research papers on iago character analysis moisturizer. It a believe under.’s Audience Data Marketplace sells “actionable audience data on more than 300 million users, . . . over 80% of the entire U.S. Internet population.”

Bluekai’s Data Activation System lets you monetize your data sets.

Handshake lets individuals market their personal data.

Brilig is a “cooperative data marketplace for online advertising that enables buyers and sellers to trade reach and relevance.”

Data Brokers

There are many commercial data brokers. Here is a partial list:

Acxiom sells several marketing data solutions.

Arcametrics sells tailored prospect lists.

Archives sells genealogical data.

Dun & Bradstreet sells many finance, operations, marketing, sales, and service business data subscriptions.

Equifax sells demographic and credit consumer data, and a wide variety of business data sets.

Experian sells a variety of business profiling data.

FICO sells several commercial scoring services, such as bankruptcy and fraud scores.

i360 sells voter and consumer data.

iBehavior sells subscriptions to 12 billion transaction records, as well as 30 million business contacts and 190 million individuals.

IHS sells a wide variety of raw data and analytical/forecasting results for many industries, such as automotive, defense, energy, and maritime.

IXI sells financial, economic, behavioral, and demographic consumer data, as well as geocoding services. The IXI customer can query IXI’s consumer database using financial, economic, demographic, and geographic constraints. IXI also sells bespoke business prospecting lists.

MaxMind sells the GeoIP2 geolocation databases, which identify the location, organization, connection speed, and user type for each IP address.

Pitney Bowes markets a variety of geospatial, postal, consumer, and business data sets.

Profound Networks sells profiles of corporate IP networks and network activity. These data sets can be applied in real time to target online advertising to business visitors.

RapLeaf enriches your list of email addresses with additional data, as well as analyzing your email list over various dimensions.

Relevate sells business and consumer data sets and data-enrichment services, including realtime services.

Rosslyn Analytics cleanses and enriches business data.

TLO (now owned by TransUnion) maintains the TLOxp database, which lets corporations and law-enforcement agencies investigate individuals, locations, and companies.

V12 Group has a collection of databases, including consumer, business, automotive, personality, and online audience, which you can query over many dimensions.

Price Data

Wikipedia maintains a list of financial-data feeds.

A Google search for ‘historical market data’ returns a good list of market-history data providers.

PriceStats publishes daily inflation statistics through State Street Global Exchange.


Google Books has digitized over 10% of the world’s books. You can search Google Books by keyword, plot a phrase’s popularity over time, or use the Google Books API to integrate book search into a text-mining application. Google has also published a database of one-trillion-word N-grams from public Web pages.

Google Scholar lets you search numerous research and academic journals, as well as all U.S. federal and state case law. A Python library is available to query Google Scholar and parse its output.

BYU publishes a list of publicly available corpora and N-grams.

The widely used Reuters-21578 corpus is one of many datasets available at the UC Irvine repository.

Wikileaks freely distributes its entire corpus.

One can search Twitter using the R twitteR library.

The Enron email corpus is a famous collection frequently use for information-extraction research.

Academic and Open Source

There are far too many open source, research, and academic data sets online to catalog here. These data sets are relatively easy to find with your favorite search engine. Here are a few examples:

Amazon. com maintains its Public Data Sets that you can easily integrate into Amazon AWS applications.

The UC Irvine Machine Learning Repository maintains nearly 300 open-source datasets.

MIT Libraries publishes a list of APIs interfacing to various research datasets.

Indiana University published 53.5 billion Web clicks made by university users.

Pew Research publishes several social science, political science, and news datasets.

The American Psychological Association publishes a list of psychology-research datasets and repositories.


The U.S. Federal Government publishes a great deal of non-classified data online at

The World Bank publishes many free data sets online.

Some U.S. states publish data; see for example Colorado’s data.


Facebook’s social graph is the largest social-network dataset in the world. The Facebook Platform gives third parties access to Facebook’s social graph.

Gnip provides “full firehose” access to several other social datasets, including Foursquare, Tumblr, WordPress, and Twitter; plus “managed access” to the APIs of Facebook, YouTube, Instagram, Google+, Flickr, and other online social services.


U.S. government geospatial data is available online.

The U.S. Department of Agriculture maintains its Geospatial Data Gateway of environmental and natural-resource data.

The U.S. Environmental Protection Agency publishes a great deal of scientific data.

Pitney Bowes markets geospatial data.

The North Carolina State University libraries maintain a great list of U.S. geospatial data sources.


The National Oceanic and Atmospheric Administration (NOAA) is the world’s largest provider of weather and climate data. For example, NOAA has a National Convective Weather Forecase Product.

The National Center for Atmospheric Research (NCAR) publishes many weather-related data sets. NCAR also publishes convective weather forecasts.

The National Weather Service has a suite of weather-forecast products. See for example its Winter Weather Forecasts.

Pitney Bowes markets a Risk Data Suite Weather Bundle.


[i] See for example and the Committee on Commerce, Science, and Transportation’s report, “A Review of the Data Broker Industry: Collection, Use, and Sale of Consumer Data for Marketing Purposes” (U.S. Senate, 2013), (visited March 16, 2014).