What HTTP methods a crawler cannot crawl?

Question:

A conceptual doubt (or not):

Of the HTTP methods, which ones cannot be "crawled" – or interpreted – by a crawler ?

  • POST
  • GET
  • PUT
  • PATCH
  • DELETE

Can someone with knowledge of the subject answer us?

Answer:

By theory, crawlers often execute idempotent and safe methods – OPTIONS, GET, HEAD .

From the book " Cloud Standards: Agreements That Hold Together Clouds ": "Web crawlers, for example, use only safe methods to avoid disturbing data on the sites they crawl"

Or: " Web crawlers, for example, use only secure methods to avoid disturbing data about crawling sites "

Which makes perfect sense for the purpose of a crawler, if you think about it logically.

A great reference on the subject is https://www.whitehatsec.com/blog/http-methods/

Idempotency and safety are important attributes of HTTP methods. An idempotent request can be called repeatedly with the same results as if it only had been executed once. If a user clicks a thumbnail of a cat picture and every click of the picture returns the same big cat picture, that HTTP request is idempotent. Non-idempotent requests can change each time they are called. So if a user clicks to post a comment, and each click produces a new comment, that is a non-idempotent request.

Safe requests are requests that don't alter a resource; non-safe requests to have the ability to change a resource. For example, a user posting a comment is using a non-safe request, because the user is changing some resource on the web page; however, clicking the cat thumbnail is a safe request, because clicking the cat picture does not change the resource on the server.

Production safe crawlers consider certain methods as always safe and idempotent, eg GET requests. Consequently, crawlers will send GET requests arbitrarily without worrying about the effect of repeated requests or that the request might change the resource. However, safe crawlers will recognize other methods, eg POST requests, as non-idempotent and unsafe. So, good web crawlers won't send POST requests.

RFC on Safe and Idempotent Methods: http://w3.org/Protocols/rfc2616/rfc2616-sec9.html

9.1.1 Safe Methods

Implementors should be aware that the software represents the user in their interactions over the Internet, and should be careful to allow the user to be aware of any actions they might take which may have an unexpected significance to themselves or others.

In particular, the convention has been established that the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval. These methods ought to be considered "safe". This allows user agents to represent other methods, such as POST, PUT and DELETE, in a special way, so that the user is made aware of the fact that a possibly unsafe action is being requested.

Naturally, it is not possible to ensure that the server does not generate side-effects as a result of performing a GET request; in fact, some dynamic resources consider that the feature. The important distinction here is that the user did not request the side-effects, so therefore cannot be held accountable for them.

9.1.2 Idempotent Methods

Methods can also have the property of "idempotence" in that (aside from error or expiration issues) the side-effects of N > 0 identical requests is the same as for a single request. The methods GET, HEAD, PUT and DELETE share this property. Also, the methods OPTIONS and TRACE SHOULD DO NOT have side effects, and are inherently idempotent.

However, it is possible that a sequence of several requests is non-idempotent, even if all of the methods executed in that sequence are idempotent. (A sequence is idempotent if a single execution of the entire sequence always yields a result that is not changed by a reexecution of all, or part, of that sequence.) For example, a sequence is non-idempotent if its result depends on a value that is later modified in the same sequence.

A sequence that never has side effects is idempotent, by definition (provided that no concurrent operations are being executed on the same set of resources).

Scroll to Top