python – Import CSV into Pandas database without converting string to tuple

Question:

I load a CSV file with more than 3 million lines and about 770 Mb , I use pandas and I need to convert a column that is in string format. Below the 'lbBins' column, which when reading from the CSV came in string format (what is the best pattern to save the data in CSV?), and the columns: lnBin1 to lbBin5 resulting from the "reshapeBin" function below.

tempFrame[['lnBins','lnBin1', 'lnBin21, 'lnBin3', 'lnBin4', 'lnBin5']].tail(2)

2445169 (0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, ...   (0, 1, 1, 0, 0) (0, 1, 0, 1, 1) (1, 1, 0, 0, 0) (1, 1, 1, 1, 1) (0, 1, 1, 0, 1)
2445170 (0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, ...   (0, 1, 1, 0, 0) (0, 1, 0, 1, 1) (1, 1, 0, 0, 0) (1, 1, 1, 1, 1) (0, 1, 0, 1, 1)

As you can see in the 'reshapeBin' function I need to perform several functions:

eval()
np.array()
.reshape(5,5)
[num]
.tolist()
tuple()

I use eval() to convert the table row, converting from string to tuple, then convert to array and reshape, get row by row of array in [num], convert to list and then convert to tuple to save to table , in order to save the table in CSV again.

Function, but I think I can improve something else to make processing faster:

def reshapeBin(x, num):
    return tuple(np.array(eval(x)).reshape(5,5)[num].tolist())

for n in range(0,5):
    tempFrame['lnBin'+str(n+1)]=tempFrame['lnBins'].apply(reshapeBin, num=n)
    print('finalizei o ', n)

Probably the way I'm saving from pandas to csv isn't the best option, at least the data format: in tuple table and to csv in string, and vice versa.

Answer:

I couldn't quite understand your problem – at least a whole row of the matrix would help – and understand if you want to optimize the CSV encoding – or just want the final read as well.

In questions of this type it helps to have access to at least one valid sample of the data you want to transform – you only put two lines of text output (summarized by Pandas) – and anyone wanting to help you has to wonder what the input was like: not exactly the kind of thing that helps.

However I'm writing the answer, because you have something glaring there: you make the whole process with the 25 numbers of your matrix repeating itself 5 times in each line, by calling the function "reshapeBin" 5 times.

"eval" is also far from ideal to use. (If people outside your team have access to write to your data, any Python program can be put in the middle of it and will run on your server – but other than that it's not the most performing thing in the world).

Anyway, just considering the few lines you put there, you can avoid having to transform each line the same way 5 times by doing something like this:

...
from ast import literal_eval

tempFrame["data"] = tempFrame["lnBins"].apply(lambda x: np.array(literal_eval(x), dtype="u1")))

for i in range(5):
    tempFrame[f"lnBin{i+1}"] = tempFrame["data"].apply(lambda x: x[i * 5: i *5 + 5])

del tempFrame["data"]

Basically, instead of calling expensive "eval" 5 times for each row, I create a temporary column with the string decoded for numbers (and to preserve memory, instead of tuple, we put an np.array there). Then the same idea of ​​using the 5 applys in a row (note the current way of creating the column name), but each time taking only one slice of the sequence of 25 numbers (no reshape needed)). I also use ast.literal_eval which is less problematic than eval.

Scroll to Top