archive – How to convert a binary file to ASCII on UNIX?

Question:

I am looking for a certain string in a fairly large file:

$ ls -lh archivo.csv
-rw-rw-rw- 1 yo yo 723M Dec 10 10:46 archivo.csv

If I use grep I find that the result does not appear, but only the indication that there is one in the file:

$ grep "12345" archivo.csv
Binary file archivo.csv matches

So looking at the type of file in question, I see that it is …

$ file archivo.csv
archivo.csv: ISO-8859 text, with very long lines, with CRLF line terminators

I have converted it to UNIX with the dos2unix command:

$ dos2unix archivo.csv
dos2unix: converting file archivo.csv to Unix format...

But the problem keeps popping up:

$ grep "12345" archivo.csv
Binary file archivo.csv matches

I have noticed after grep has an option to search binaries, the -a :

$ grep -a "12345" archivo.csv
12345  esto es un test

Well, man grep indicates that:

-a, --text
    Process a binary file as if it were text; 
    this is equivalent to the --binary-files=text option.

But still I wonder, how can I convert this binary file to ASCII?

Answer:

Actually, all files are binary (obviously), but when we give that binary encoding an X interpretation, then we say it is encoded X (or encoded in X).

In your case, the file is not binary , it has the ISO-8859 encoding and therefore you must use tools that know how to work (understand) such encoding.

The -a parameter of grep forces it to ignore certain codes that are not interpreted as an ASCII text string (eg the \x0 ).

Thus, in your case, you should convert this file to another one more suitable for your tools, for which, of course, there are many tools but for me, the one I like the most is iconv, which in your case would be something like (from the same ref)

$ iconv -f ISO-8859-15 -t UTF-8 foo >foo.utf

(NOTE: instead of utf you could pass it to ASCII as you request, but then you may lose existing information in the original file such as the § symbol).

For example, taking this file we have

$ file samples7.var
samples7.var: HTML document, ISO-8859 text
$ grep Deut samples7.var
Binary file samples7.var matches
$ grep -a Deut samples7.var
<TITLE>German / Deutsch S▒d (ISO Latin-1 / ISO 8859-1)</TITLE>
<H1>German / Deutsch S▒d (ISO Latin-1 / ISO 8859-1)</H1>
$ iconv -f ISO-8859-15 -t UTF-8 samples7.var > samples7.var.utf
$ file samples7.var.utf
samples7.var.utf: HTML document, UTF-8 Unicode text
$ grep Deut samples7.var.utf
<TITLE>German / Deutsch Süd (ISO Latin-1 / ISO 8859-1)</TITLE>
<H1>German / Deutsch Süd (ISO Latin-1 / ISO 8859-1)</H1>

That, as we can see, allows you to view and filter correctly without losing information.

Finally, using dos2unix does not work for you in this case, because the command requires that the files be plain text, and your file does not have that encoding (see dos2unix ).

Scroll to Top