When developing analytic system system sizing is one of things that it is beneficial to do ‘early and often’. However, getting a reasonable estimate early on in a project can seem difficult. However there is a simple trick that can be used for most analytical big data systems – data warehouses, data marts and data lakes.

I mention this in my latest paper on the logical data warehouse, and whilst discussing it was suggested that this simple sizing tip would be a good thing to pass on in a blog. This little trick hinges on two insights.

For most large data stores a large proportion of the data resides in the ten or twenty largest tables or data sets. A system may have tens or hundreds of tables, but typically the majority of the data is held in just a small subset of the tables. These large tables are typically easily linked to readily available business metrics.

For example, for Telecommunications the biggest tables are typically call detail records, and network events. For Retail it will be point of sale basket items and stock movements, for finance it will be financial transactions and the events leading up to them. If we take these main tables, and the next eight or nine largest then we probably have 95% or more of the overall data. This is because these tables typically store the several years of history we’ll dip into for our analysis.

Read the entire article here, Simple Sizing for Big Data Stores from Business Metrics

via the fine folks at Gartner