I have a large dataset (~1,700,000) that I would like to sort. I also have a not-so-small sample (~8,000) sorted as one of these classes (say, condition A), but I don't have any (zero) of the other classes (say, conditions B to Z). Also, all variables are categorical.
Although there are numerous categories, I am only interested in one of them (the one that I have some sample, with condition A).
Am I able to train the model with only type A observations? If not, how should I overcome this problem?
Is it reasonable to change the problem form to a binary type classification (type A would be TRUE and the other types FALSE)? In this case, can I randomly take some of the unclassified observations and assume the condition is FALSE? I know that most unclassified observations would be of type B to Z (in the binary FALSE case).
Thanks in advance.
You can turn the problem into binary if your assumption that among the unclassified the majority is false, as you say in your question. (The right thing would be not to have any positives in the unclassified ones, but if it's very small, it probably won't hurt)
I know that most unclassified observations would be of type B to Z (in the binary FALSE case).
In fact, many classifiers use this when using the one-vs-rest strategy
As per the discussion of the comments, I highlight:
- if there are condition A observations within your 1.7M pool and your 8000 sample is not a subsample of the 1.7M set, this is probably not the best approach.
- if the amount of condition A observations from the set of 1.7M is really small, this method, despite being biased, will have better success than randomly selecting a class.