Question:
I'm comparing two files, which are updated daily, with the diff -y
command in order to get two results:
The first is the lines that were changed overnight:
grupoAzul;Gabriel;04-maçãs;02-limões | grupoAzul;Gabriel;05-maçãs;02-limões
grupoAzul;Amanda;03-maçãs;05-limões grupoAzul;Amanda;03-maçãs;05-limões
To do this, I use the command diff -y arquivoAntigo.csv arquivoNovo.csv | grep -e "|"
The second is the new lines:
grupoAzul;Gabriel;04-maçãs;02-limões | grupoAzul;Gabriel;05-maçãs;02-limões
grupoAzul;Amanda;03-maçãs;05-limões grupoAzul;Amanda;03-maçãs;05-limões
> grupoAzul;Kratos;04-maçãs;00-limões
For this result the command diff -y arquivoAntigo.csv arquivoNovo.csv | grep -e">"
is used.
That explained, let's go to the error
When a new line appears on top of the modified line, diff 'pushes' the modified line down and considers it as the new line and what was to be the new line it considers as the modified line.
grupoAzul;Gabriel;04-maçãs;02-limões | grupoAzul;Kratos;04-maçãs;00-limões
> grupoAzul;Gabriel;05-maçãs;02-limões
grupoAzul;Amanda;03-maçãs;05-limões grupoAzul;Amanda;03-maçãs;05-limões
These events are, in fact, rare to happen but when they happen I have more than one line damaged.
What causes this bug and how can I fix it?
Answer:
The problem is caused because the same records do not appear on the same line in both files. diff compares files line by line . In the example problem you showed, line 2 of the file on the left is different from line 2 of the file on the right, so it must be marked with ">".
To avoid this circumstance, use sort
so that all matching records appear on the same line in both files:
$ diff -y <(sort arquivoAntigo.csv) <(sort arquivoNovo.csv)
<
grupoAzul;Amanda;03-maçãs;05-limões grupoAzul;Amanda;03-maçãs;05-limões
grupoAzul;Gabriel;04-maçãs;02-limões | grupoAzul;Gabriel;05-maçãs;02-limões
> grupoAzul;Kratos;04-maçãs;00-limões
However, as you can see, the white space in the first file gets first place in the sort
algorithm, so I also suggest removing the white lines using sed
:
$ diff -y <(sort arquivoAntigo.csv | sed '/^\s*$/d') <(sort arquivoNovo.csv | sed '/^\s*$/d')
grupoAzul;Amanda;03-maçãs;05-limões grupoAzul;Amanda;03-maçãs;05-limões
grupoAzul;Gabriel;04-maçãs;02-limões | grupoAzul;Gabriel;05-maçãs;02-limões
> grupoAzul;Kratos;04-maçãs;00-limões
The regular expression used in sed
( /^\s*$/
) searches for all lines that contain zero or more blank characters, such as spaces and tabs, and deletes them with the d
command.
In time, the notation <( ... )
, in bash
is for the command enclosed in parentheses to be previously executed in a subshell . So, when running the diff
above, the sort ... | sed ...
are executed and return already handled temporary files for comparison via diff
.
To see it working online on tutorialspoint, with the exception that it doesn't seem to be possible to create files there, so I had to use variables to "simulate" them: http://tpcg.io/aO9pny