Welcome to the first of my “Quick Hits” entries on the blog! I am still trying to find the proper shape for my blog formats, and I hope this one will persist going forward.
Quick Hits entries will be more focused on a problem I found interesting and a solution to that problem. Other content will usually be written to explore a machine learning or statistics concept in detail, to explain tools for data science, or to examine data sets or recent projects. Unlike that long-form content which could be 5-15 minutes of reading, I aim to keep the Quick Hits in the 2-3 minute range.
Problem
When creating weekly metrics charts, including data from the current week can cause graphs to dive sharply in the current week since data collection for the current week will be incomplete. Though the incomplete data is not problematic for an analyst, end users can be concerned if their number appear to be declining sharply.
The same issue can present itself when looking at daily, monthly, yearly, etc. data if you include the current timeframe. The problem then is finding a way to exclude the current timeframe’s data.
Solution
Luckily, the solution is quite easy! I discovered a method recently when faced with this exact problem at work. My solution relies on the DATE_TRUNC function which exists in Redshift, Postgresql, and several other varieties of SQL. If your variety does not have the same function, you should still be able to find some kind of equivalent (just use whatever method you were using to collate your data into weeks anyway).
When using DATE_TRUNC on a date to truncate to the week, it returns a date object specifying midnight of the first day of the week. Therefore, you can call DATE_TRUNC on CURRENT_DATE to obtain the timestamp of the beginning of your current week.
At that point, filtering is easy. You can filter out anything where the timestamp is greater than the truncated date. Alternatively, you can run the DATE_TRUNC function on both your timestamp and on the current date. Then you can compare the resulting dates and filter anywhere where you a match.
Example
As an example, let’s consider a table recording customer visits (visits). This table has two columns, visit_time and customer_id. If you want a weekly count of visits to use in a line graph or other visualization, you could get the result using this:
select date_trunc(‘week’, visit_time) as week, count(*) as visit_count from visits group by 1
However, that query produces the exact downturning graph we are wanting to avoid! To remove that current week’s data, just modify the query slightly to include this:
select date_trunc(‘week’, visit_time) as week, count(*) as visit_count from visits where visit_time < date_trunc(‘week’, current_date) group by 1
With this small filtering statement, you can cut out all the current week’s data and display only completed weeks in your weekly graph.
Conclusion
Filtering out the current timeframes’s data is ultimately pretty easy once you know one or two ways to do it! By filtering out this data may not give you the most up-to-date information, but it will help ease the fears of product managers and other end users who see a sudden dip in the graphs. Try it out in your workflows, and see if it works well for your work!