python – 2D array filling / Text vectorization

Question:

You need to fill a two-dimensional array. the element with index (i, j) in this matrix must be equal to the number of occurrences of the j-th word in the i-th sentence. (sentences are read from the file, split into lists, and all words found in the sentences are added to dictionary d, where the key is the word and the ordinal index is the value).

input 22 sentences cast to lists of the form ['in', 'comparison', 'to', 'dogs', 'cats', 'have', 'not', 'undergone', 'major', 'changes', 'during', 'the', 'domestication', 'process'] should result in a 22×253 matrix (22 as sentences, 253 as total unique words used in sentences). The words are collected in a dictionary of the form [word: index]. If a word from the dictionary occurs 2 times in 1 sentence, and its index in the dictionary is 1, the element m[1, 1] should be replaced by 2, and so on.

I created an empty matrix and ran iteration, but it still remains zero, I don’t understand where the error is

m = np.zeros((number_line, len(new_line)))
i = 0
for line in f.readlines():
    for x in line:
        a = line.count(x)
        j = d[x]
        m[i, j] = a
    i += 1

Answer:

Use sklearn.feature_extraction.text.CountVectorizer and Pandas.SparseDataFrame .

For large texts – this will work orders of magnitude faster (compared to the solution using nested loops) and take up several orders of magnitude less memory (the final data is presented as a sparse ( sparse ) matrix)

Example:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

sentences = [
  "'It's Raining Cats and Dogs",
  "Do cats like dogs or hot dogs?",
  "Cats prefer hot dogs!"
]

cv = CountVectorizer(stop_words='english')

r = pd.SparseDataFrame(cv.fit_transform(sentences),
                       columns=cv.get_feature_names(),
                       default_fill_value=0)

Result:

In [201]: r
Out[201]:
   cats  dogs  hot  like  prefer  raining
0     1     1    0     0       0        1
1     1     2    1     1       0        0
2     1     1    1     0       1        0
Scroll to Top