If you have a large plain text data file, that needs to be quickly plotted, it can be useful to thin the data by selecting every nth line, particularly if the dataset is too large to be loaded into memory using numpy. This post is based around this SuperUser question.
If we have a a .csv
file structured like this:
And want to take every second row, and store them in a new file with the same structure,
we can use awk
at the unix shell:
awk is a powerful scripting language for manipulating line based data, which has
many functions well beyond the scope of this post, see this guide for more details.
NR
denotes the number of rows in the file, so using NR == 1
will keep the header
row of the file (line 1) and NR % 2 == 0
will keep any line which has a line number
divisible by 2. The contents of NewDataFile.csv
would be:
If you want to change the amount of data thinning, change the value 2
in the command NR % 2 == 0
to a larger value to increase the thinning amount. If you have a data file with no headers,
the command is simplified to: