Howto: Splitting Files With Standard Python Scripts
Ready-Made Data Sets Which Explode the Limits
I am frequently confronted with raw data that is provided to me for analysis and which, when uncompressed, can easily encompass files of half a gigabyte or more. Starting from one gigabyte and over, the desktop-supported statistics tools slowly become strained. There are, of course, tool options for only selecting part of the columns, or only loading the first 10,000 lines, etc.
But what should you do when you only want to take a random sample from the data provided? You should never rely on the file being randomly sorted. It may already have gained systematic sequence effects due to processes in the database export. It also may be the case, that you only want to analyse a tenth of a grouping, such as the purchases made by every tenth customer. To this end, the complete file has to be read as otherwise it is impossible to ensure that all of the purchases of the filtered customers are taken into account.








