Intro
When starting a personal data science project, often the most difficult part is figuring out where to obtain the data you will use. Whether you are just looking for some test data to use when learning a new technique or are looking to investigate the answer to a question you have, you need to have a good quality data source.
A lot more free, public data sources exist than you may expect. Check out some of my most commonly-used resources below!
Census
The U.S Census Bureau not only does the decennial census every 10 years, but they also release similar data every year in their American Community Survey. All of that data is freely available at the Census Bureau’s Website.
However, I would not suggest using their website. Maybe I just do not know how to navigate it well, but I have struggled to find the data I want since they updated their interface a couple years ago. I would suggest going to the NHGIS Data Finder website. On that site, you can specify the geographic area you are interested in and search for the data fields you want. They collate the census data and present it in a way that is much more intuitive.
Pros
- Quality data about demographics, education, economy, and more
- Downloadable in easy CSV files
- The geographic unit on NHGIS Data Finder makes cross-comparisons easy
Cons
- Though the NHGIS website is easier than the Census website, both can still be confusing or overwhelming for new users
Kaggle
Kaggle is a data science platform best known for its competitions. Beyond the competitions, they are also really great for locating datasets! They have a dedicated Kaggle datasets page where you can upload your own sets or download datasets from other users rated based on their usability.
Since Kaggle datasets are sourced from their competitions and from user-uploaded content, they can have some really interesting material. That source is where I got NBA and NFL birthdays for my post analyzing the birthday problem. Because of the great variety of dataset types, you can find good sets either for testing new techniques or analyzing the data to find answers to interesting questions.
Pros
- Varied datasets
- User ranked
- Easy downloads
Cons
- Not all datasets are formatted well for easy use
- Since they are user generated, the datasets cannot always be trusted. Check the original source.
Built-in Packages
Both Python and R have packages with toy datasets that are perfect for learning the basics of any new technique. In Python, import Scikit-learn, and then you can access several datasets such as the California housing dataset or the iris dataset. To import them, try commands like these:
from sklearn import datasets housing_data = datasets.fetch_california_housing(as_frame = True) iris_data = datasets.load_iris(as_frame=True)
In R, you can access the same iris dataset as in Python. Try these commands:
library(datasets) data(iris)
The great thing about these toy datasets is that they are super popular, and many guides exist online for exploring them in depth. You can easily find well-written guides for learning different analytics techniques.
Pros
- Easy to access within your current code editor without tons of searching
- Tons of free guides for different techniques using these exact datasets
Cons
- These datasets are not terribly useful or informative beyond practicing with new tools; they will not answer any research questions
Government Websites
I already discussed the Census Bureau because they are one of the largest and most common sources of government data, but there are many other sources! Often, state governments will have some of the best sources of state- and county-level data.
The difficulty with government data is that it can be of varying quality. Sometimes, it will be in easy-to-use CSV files, but it will just as often be in frustrating formats like Word documents or PDFs. In an upcoming project, I am going to investigate school performance in West Virginia, and I had to copy over and clean up data from multiple PDF sources.
Since this blog is focused on Appalachian data, all of my recommendations at this level are Appalachian sources. Here are some recommendations:
- The West Virginia Department of Education has free data available about school finance, performance, attendance, and more.
- The West Virginia GIS Clearinghouse has excellent GIS data files (some of which can be used in R and Python too with proper coding), some links to other common data source within the WV government, and links to similar sites for surrounding states. This site is often the start of my search.
Pros
- More niche data than can generally be found elsewhere
- Focused in on specific topics (at the federal government agency level) or on specific regions (at the state and county level)
Cons
- Data can come in a variety of difficult formats
- Dataset information can sometimes be difficult to find so interpretation of the data can be difficult
APIs and Web Scraping
If you have the skills and experience, you can gather some really interesting data through the APIs of online services. APIs have some drawbacks though. Depending on the service, you may have to sign up and be approved for a developer account. You may face velocity restrictions, or you may have limited access to data.
Someday, I will probably write another post listing some of the available APIs in more detail, but here is a short list for now:
- Twitter—Often used for tweet sentiment analyses
- Spotify—I used this for a project once that I may clean up and post on the blog someday
- Lichess—Popular chess site with vast amounts of chess game data
These APIs can give the data to answer some really neat questions if you are willing to learn their particular documentation and implement it in your code.
Web scraping has basically the same benefits and draw-backs as API usage. It can be more difficult and require more skill. It can be subject to anti-scraping efforts or rate limiting by the website being scraped. However, it can produce some really insightful datasets that are inaccessible otherwise.
Pros
- Unique data
- Greater access than what is available in downloadable, pre-configured sets
Cons
- Can sometimes be difficult to work with
- Can experience gating or difficulty signing up
Search engines
If nothing else works, search for it! You never know what exactly is available until you search. Many resources exist that you can hunt down and use.
Pros
- Easy and low barrier to entry
Cons
- Can be difficult to find what you need
- May not be trustworthy data
Simulating data
If none of the options above give you the data you need, you can always simulate it! Obviously, this method will not work for most investigative projects, but simulated data is excellent for learning more about machine learning techniques and processes.
You can simulate clusters of data and then try different clustering algorithms. You can simulate linear data and then try different linear models. You can simulate classification data and then test out classification techniques. You imagination is the main limit here.
Pros
- Versatile
- Easy, if you are familiar with simulation techniques
- Great for teaching or demonstrating
Cons
- Not helpful for studying most real-world problems (outside of specific simulation studies which I can definitely cover someday!)
Conclusion
As you can see, tons of options exist for finding data to use for personal projects! I suggest checking out some or all (or even a combo of sources) and finding what works best for you.
Personally, I am about to start a personal data project involving data from the Census Bureau, the WV Department of Education, and possibly the WV GIS Clearinghouse. Keep an eye on the blog for posts related to that project!