Question:
I have the following dataframe as an example:
"""
Datetime Product1 Product2.....ProductN
01-01-2019 | 0 | 0 ... 0
01-02-2019 | 0 | 5 ... 0
01-03-2019 | 0 | 6 ... 0
01-04-2019 | 0 | 5 ... 0
01-05-2019 | 0 | 5 ... 0
01-06-2019 | 0 | 5 ... 0
01-07-2019 | 12 | 4 ... 0
01-08-2019 | 0 | 7 ... 0
01-09-2019 | 16 | 0 ... 0
01-10-2019 | 15 | 2 ... 0
01-11-2019 | 18 | 6 ... 0
01-12-2019 | 15 | 7 ... 0
I need to get the value in the column "datetime" of the first value that is different from 0 (that is, the first sale) and generate a new dataframe with this value and its difference in months versus the current date, as shown below. below (considering we are in January 2020):
Product | fist sell | dif
Product1 | 01-07-2019 | 5
Product2 | 01-02-2019 | 11
. | . | .
. | . | .
. | . | .
ProductN | Not launched| Not Launched
Thanks a lot
Answer:
input data
I'll create an example dataframe to work on:
import pandas as pd
data={
'2019-01-01':[0,0,0],
'2019-02-01':[0,5,0],
'2019-03-01':[0,6,0],
'2019-04-01':[0,5,0],
'2019-05-01':[0,5,0],
'2019-06-01':[0,5,0],
'2019-07-01':[12,4,0],
'2019-08-01':[0,7,0],
'2019-09-01':[16,0,0],
'2019-10-01':[15,2,0],
'2019-11-01':[18,6,0],
'2019-12-01':[15,7,0],
}
df = pd.DataFrame.from_dict(data, orient="index",
columns=["Product{}".format(n) for n in range(1, 4)])
df.index = pd.to_datetime(df.index)
Notice that I've made the index the date column, and cast them to type datetime
. This is the current dataframe:
Product1 Product2 Product3
2019-01-01 0 0 0
2019-02-01 0 5 0
2019-03-01 0 6 0
2019-04-01 0 5 0
2019-05-01 0 5 0
2019-06-01 0 5 0
2019-07-01 12 4 0
2019-08-01 0 7 0
2019-09-01 16 0 0
2019-10-01 15 2 0
2019-11-01 18 6 0
2019-12-01 15 7 0
Get the date of the first sale
I choose all the cells with a value other than zero, I calculate their accumulated sum and I keep the minimum of those accumulated sums, which correspond to the cell in which a numerical data appears for the first time. I get the index corresponding to that case, and that gives me the date on which that occurs. Since the result is of type datetime
, which also contains the time and that doesn't interest me, I keep only the .date
part:
>>> df[df!=0].cumsum().idxmin().dt.date
Product1 2019-07-01
Product2 2019-02-01
Product3 NaT
dtype: datetime64[ns]
but I can convert that to a DataFrame, and rename the column to be "first sell":
result = pd.DataFrame(df[df!=0].cumsum().idxmin().dt.date).rename({0: "first sell"}, axis=1)
Get the number of months
To this dataframe I add the column with the number of months elapsed. I obtain this number by dividing the number of days elapsed by 30, which comes from a subtraction between the current date and the one I have in the "first sell" column. Finally I replace the NaT
("not a date") with the string "Not Launched":
import datetime
result["dif"] = (datetime.date.today() - result["first sell"]).dt.days//30
result = result.fillna("Not Launched")
This is the result:
first sell dif
Product1 2019-07-01 6
Product2 2019-02-01 11
Product3 Not Launched Not Launched