How to find data for data science or GIS projects

Written by Chase Thacker

I am an adoptive father of two boys. My wife and I love West Virginia and hope to share that love of the Appalachian region with others. I do my small part by blogging about data science and Appalachian topics. For fun, I like to play hobbyist board games, read tons of books (particularly sci-fi and fantasy), and pretend to know what I am doing in my woodshop.



Appalachia | Data Science | Tools



February 12, 2021

Intro

When starting a personal data science project, often the most difficult part is figuring out where to obtain the data you will use. Whether you are just looking for some test data to use when learning a new technique or are looking to investigate the answer to a question you have, you need to have a good quality data source.

A lot more free, public data sources exist than you may expect. Check out some of my most commonly-used resources below!

Census

The U.S Census Bureau not only does the decennial census every 10 years, but they also release similar data every year in their American Community Survey. All of that data is freely available at the Census Bureau’s Website.

However, I would not suggest using their website. Maybe I just do not know how to navigate it well, but I have struggled to find the data I want since they updated their interface a couple years ago. I would suggest going to the NHGIS Data Finder website. On that site, you can specify the geographic area you are interested in and search for the data fields you want. They collate the census data and present it in a way that is much more intuitive.

Pros

Quality data about demographics, education, economy, and more
Downloadable in easy CSV files
The geographic unit on NHGIS Data Finder makes cross-comparisons easy

Cons

Though the NHGIS website is easier than the Census website, both can still be confusing or overwhelming for new users

Kaggle

Kaggle is a data science platform best known for its competitions. Beyond the competitions, they are also really great for locating datasets! They have a dedicated Kaggle datasets page where you can upload your own sets or download datasets from other users rated based on their usability.

Since Kaggle datasets are sourced from their competitions and from user-uploaded content, they can have some really interesting material. That source is where I got NBA and NFL birthdays for my post analyzing the birthday problem. Because of the great variety of dataset types, you can find good sets either for testing new techniques or analyzing the data to find answers to interesting questions.

Pros

Varied datasets
User ranked
Easy downloads

Cons

Not all datasets are formatted well for easy use
Since they are user generated, the datasets cannot always be trusted. Check the original source.

Built-in Packages

Both Python and R have packages with toy datasets that are perfect for learning the basics of any new technique. In Python, import Scikit-learn, and then you can access several datasets such as the California housing dataset or the iris dataset. To import them, try commands like these:

from sklearn import datasets
housing_data = datasets.fetch_california_housing(as_frame = True)
iris_data = datasets.load_iris(as_frame=True)

In R, you can access the same iris dataset as in Python. Try these commands:

library(datasets)
data(iris)

The great thing about these toy datasets is that they are super popular, and many guides exist online for exploring them in depth. You can easily find well-written guides for learning different analytics techniques.

Pros

Easy to access within your current code editor without tons of searching
Tons of free guides for different techniques using these exact datasets

Cons

These datasets are not terribly useful or informative beyond practicing with new tools; they will not answer any research questions

Government Websites

I already discussed the Census Bureau because they are one of the largest and most common sources of government data, but there are many other sources! Often, state governments will have some of the best sources of state- and county-level data.

The difficulty with government data is that it can be of varying quality. Sometimes, it will be in easy-to-use CSV files, but it will just as often be in frustrating formats like Word documents or PDFs. In an upcoming project, I am going to investigate school performance in West Virginia, and I had to copy over and clean up data from multiple PDF sources.

Since this blog is focused on Appalachian data, all of my recommendations at this level are Appalachian sources. Here are some recommendations:

The West Virginia Department of Education has free data available about school finance, performance, attendance, and more.
The West Virginia GIS Clearinghouse has excellent GIS data files (some of which can be used in R and Python too with proper coding), some links to other common data source within the WV government, and links to similar sites for surrounding states. This site is often the start of my search.

Pros

More niche data than can generally be found elsewhere
Focused in on specific topics (at the federal government agency level) or on specific regions (at the state and county level)

Cons

Data can come in a variety of difficult formats
Dataset information can sometimes be difficult to find so interpretation of the data can be difficult

APIs and Web Scraping

If you have the skills and experience, you can gather some really interesting data through the APIs of online services. APIs have some drawbacks though. Depending on the service, you may have to sign up and be approved for a developer account. You may face velocity restrictions, or you may have limited access to data.

Someday, I will probably write another post listing some of the available APIs in more detail, but here is a short list for now:

Twitter—Often used for tweet sentiment analyses
Spotify—I used this for a project once that I may clean up and post on the blog someday
Lichess—Popular chess site with vast amounts of chess game data

These APIs can give the data to answer some really neat questions if you are willing to learn their particular documentation and implement it in your code.

Web scraping has basically the same benefits and draw-backs as API usage. It can be more difficult and require more skill. It can be subject to anti-scraping efforts or rate limiting by the website being scraped. However, it can produce some really insightful datasets that are inaccessible otherwise.

Pros

Unique data
Greater access than what is available in downloadable, pre-configured sets

Cons

Can sometimes be difficult to work with
Can experience gating or difficulty signing up

Search engines

If nothing else works, search for it! You never know what exactly is available until you search. Many resources exist that you can hunt down and use.

Pros

Easy and low barrier to entry

Cons

Can be difficult to find what you need
May not be trustworthy data

Simulating data

If none of the options above give you the data you need, you can always simulate it! Obviously, this method will not work for most investigative projects, but simulated data is excellent for learning more about machine learning techniques and processes.

You can simulate clusters of data and then try different clustering algorithms. You can simulate linear data and then try different linear models. You can simulate classification data and then test out classification techniques. You imagination is the main limit here.

Pros

Versatile
Easy, if you are familiar with simulation techniques
Great for teaching or demonstrating

Cons

Not helpful for studying most real-world problems (outside of specific simulation studies which I can definitely cover someday!)

Conclusion

As you can see, tons of options exist for finding data to use for personal projects! I suggest checking out some or all (or even a combo of sources) and finding what works best for you.

Personally, I am about to start a personal data project involving data from the Census Bureau, the WV Department of Education, and possibly the WV GIS Clearinghouse. Keep an eye on the blog for posts related to that project!

Quick Hits: Exclude Current Week in SQL

by Chase Thacker | Jun 11, 2021 | Data Science

Welcome to the first of my “Quick Hits” entries on the blog! I am still trying to find the proper shape for my blog formats, and I hope this one will persist going forward. Quick Hits entries will be more focused on a problem I found interesting and a solution to that...

The Birthday Problem with Real-world data (Birthday Problem Pt. 2)

by Chase Thacker | May 3, 2020 | Data Science, Statistics

As promised in the previous post, I have found some real-world datasets to test out our Birthday Problem predictions on! The more advanced visualizations will have to wait until another weekend when I have some time. For this post, I will be looking at four datasets...

Exploring the Birthday Problem/Paradox (Pt. 1)

by Chase Thacker | Apr 30, 2020 | Data Science, Statistics

My first sequence of posts was all about COVID-19 in the early days of the outbreak in the US. Now, I want to turn towards something a little more fun--the Birthday Problem! This classic stats problem is also known as the "Birthday Paradox" because its conclusions run...

« Older Entries

Thanks for reading!

If you enjoyed this content, feel free to sign up for my newsletter to be updated about new content. I promise not to spam your inbox, and you should receive only receive emails when something new is available for you to read.

If you enjoy my writing, you can also check out my other blog where I review books!

Books blog

Written by Chase Thacker

Appalachia | Data Science | Tools

February 12, 2021

Intro

Census

Pros

Cons

Kaggle

Pros

Cons

Built-in Packages

Pros

Cons

Government Websites

Pros

Cons

APIs and Web Scraping

Pros

Cons

Search engines

Pros

Cons

Simulating data

Pros

Cons

Conclusion

Recent Posts

Archives

Categories

You may also like…

Quick Hits: Exclude Current Week in SQL

The Birthday Problem with Real-world data (Birthday Problem Pt. 2)

Exploring the Birthday Problem/Paradox (Pt. 1)

Thanks for reading!

Success!