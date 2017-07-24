I was tasked with moving and archiving approximately 16.5 million files from one location, while ingesting around 4,000 new files per hour. This job was for a single client, and my company has multiple customers with similar needs. Keeping the data organized and available has been a challenge for over a year, and we have been working toward a solution for some time now.

Background

My company takes raw data from the medical industry and converts those metrics into various types of measures, allowing the customer to gauge their performance and setup strategies for improvement. It also allows for several other options, but these core operations are what cause us to take in so many records.

Each record is typically tiny – no more than 100kb, and typically under 50kb in size. Individually, the files are not an issue, but in aggregate, the files cannot be listed before the default timeout is hit for such processes. This is regardless of OS, as the ingest point is a Linux server, and the archive server is Windows-based. In both cases, the files could not be listed using normal methods.

The Problem

This lack of listing made it difficult to determine a logical way to break down the data into usable chunks. There was no consistent naming convention to pull from, and even if there was, there was no way to determine what that might be. It was estimated that there were tens of thousands of files per day, but prior to attacking the problem, that was just an educated guess – there was no way to know beforehand.

