## Question:

Good evening, It is necessary to create a neural network to determine the dependence on the sequence. For example, we have a reference dependency

```
[0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0]
```

And we also have input data that can be offset:

```
[0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0]
```

At the output, I would like to have some number from 0 to 1 with the probability of coincidence. I would be glad to hear and see any material, links, and even better pseudo- or regular code.

## Answer:

The topic of determining whether two datasets are dependent (paraphrased – how similar) are broad and fairly densely researched. Sometimes I am a supporter of simple solutions, and before moving on to "pop" classifiers, it is better to go through the list of good old purely statistical (or other) methods (if we just have only arrays in general). The question does not specify the datatype, so I'll go through whatever comes to my mind at random. The nature of the data plays a key role here – it is on it that the chosen method depends.

The very first thing that comes to mind when they talk about the dependence of any variable on another is correlation . You can calculate the correlation coefficient using`numpy`

– this method will return the covariance matrix. Correlation – the basic method – there are much more sophisticated and, accordingly, complex. Correlation is a super cool thing. Omnivorous, versatile, as simple as a shovel – will never let you down. Your best friend and companion in the field of data analysis.

Another approach is to find the largest common subsequence in your data. This paragraph appeared here only because it was mentioned that the data can be shifted. In this case, we can use a rolling hash or any other methods for finding a subsequence. However, in the general case, the "shift" rule will not be fulfilled on random data.

Another idea is the most ordinary XOR . Your example:

```
[0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0]
XOR
[0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0]
=
[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0]
```

As a result, we have two units (that is, differing values) out of 11 – this is how these two masks are different (that is, the data are the same by `1 - 2 / 11 = 82%`

). The operation is so fast and cheap that you won't find it faster. However, again, it is not universal.

Another universal approach is the cosine of the angle between vectors. Or cosine similarity . This metric is initially a measure of "proximity", the similarity of two vectors (another example of proximity is the Euclidean distance between them). Also a simple, cool thing – the only thing is that your data should be representable as vectors. Here is an interesting example of using this measure to compare the similarity of two strings. For your data, this measure gives the answer 0.8 (1 – 0.2). Also take a closer look – this measure is very similar to correlation. And for good reason.

You can calculate the "closeness" of the data using the Least Squares Method (OLS). It is also simple, and it is even simpler than a shovel – it is simple as its shaft. It's also versatile.

If your data is a probability distribution (this is important, the method does not apply to normal experimental data), then the excavator distance is ideal for you.

As you can see, there are a LOT of methods for determining "proximity" and I only briefly touched on the ones I know – which one to choose – depends only on your data. And even for "complex" ones, such as sound, images, text, it is better to start by trying simple methods – they are well studied, simple – they will give you some result instantly. It is known on what data they work "badly", and on what good. Also ANN is a classifier, not a similarity determinant. Plunging into neural networks without understanding what it is and how to work with it, you risk stalling for a long time. Even though there are nice Tensorflow-style abstractions. Still, each tool has its own application. Remember, if you look into the abyss for a long time, the abyss will look at you.