Data warehouses are an effective tool for storing and performing big data analytics. Designing a date warehouse is also fun!
I chose BigQuery for generating wildfire related analytics in my fireAlert.app platform (the side project is also documented here). However, I am undecided on the method of streaming/batch loading and transforming the data.
Hands-on comparison of Cloud Functions and Dataflow (I wrote my implementations in Python and Apache Beam/Python, respectively) point me towards using Cloud Functions instead of Dataflow (at least initially).
For my present use case, cloud functions demonstrate the advantage of being simple and inexpensive. Dataflow is more expensive and seems kind of slow to spin-up, but may be more suitable for more complex data transformations. I am not yet ready to make a a conclusive statement on this topic.
I am also now acutely aware of the problem of data duplication in data pipelines. When I batch insert data into Bigquery, I have to be mindful of how I gather the data through Cloud Functions and manage my datasets and tables.
For example, the satellite hotspot data I retrieve from NASA FIRMS, is a 24-hour rolling data source. If I update the data every 15 minutes, I risk batch loading the same data multiple times. One mitigation technique I have considered is maintaining a realtime 24-hr table that overwrites the table during each batch load and once daily appends to separate archive table.
The ultimate goal of designing a data warehouse is to generate valuable and insightful analytics. I hope to build a warehouse that can do just that with wildfire related data. The NASA Firms data is the first of several useful datasets I will aggregate.