Question:
Wrote a small library for data synchronization between programs. There is a message queue. There is a server that clings to the queue, opens the port and responds to requests like "give all messages", "give messages from such and such time", "give messages for the last n ms" and so on. There is a client clinging to a similar queue and periodically polling servers with the specified addresses/ports, adding new messages to the queue. Connects, polls and disconnects. Wrote several programs that communicate through such queues. Under Windows
everything works in any version from W7
to W10
, I checked it with Qt
versions 4.7.1
, 4.8.6
, 5.5.1
. Under Linux
( Ubuntu 14.04
, Qt 5.5.1
) almost everything works too. The problem is with the word "practically".
So, there are programs A , B , C . B polls multiple sources ( A ) and forms an output queue that is polled by multiple clients ( C ). Delivery from A to B or from A to C always works. But from B to C , sometimes, if C is launched later than B , communication is not established – timeouts and packet damage occur. In this case, if B and C are on the same node, everything works. For debugging, I tried to raise Virtualbox
virtual machines on a laptop and run B and C in different virtual machines or one in a virtual machine, the other on the host – everything works, even if network delays are greatly increased. Problems appear only when working between different physical computers. I tried to rebuild B under Qt 4.8.6
– the problem appears less often, but still sometimes remains. If you run B in the Windows version, through Wine
, everything also works. Firewalls are disabled. The ports (9200-9300) used for the server are not occupied by anyone, after starting the server programs ( A , B ) the scanner shows that they (ports) are open.
Now I'm building a test bench at work (several nodes with Linux
) in order to run B under the debugger. But in general, the situation causes me the deepest bewilderment, because:
- the same code works between A and B, between A and C, but not between B and C
- the same code works between different virtual machines, but not between different physical ones
- … in the `Windows` build, but not in the `Linux` build.
- … if C is started before B, but not vice versa.
Code checked by Valgrind
and cppcheck
. Has anyone come across a similar "rake", can you suggest at least one reason for such strange behavior?
Update: It's getting weirder and weirder. I assembled a test stand – and I can not reproduce the problem. The same A , B and C cling to each other in all possible combinations. I feel that I will have to carry a spare switch to the object.
Update2: Managed to reproduce the problem on the bench. I cling to one B several C , after a while some of the clients ( C ) stop clinging. What is surprising: when the client stopped clinging, I try to ping the machine on which the server is deployed ( B ) from his machine – and it does not respond. And it starts pinging only after machine B , in turn, starts pinging the client ( C ). I have a suspicion that Ubuntu perceives this bunch of requests coming from the client as a DOS attack, and blocks the client host. It remains to find where it is written in the system settings, and turn it off.
Update3: The situation is really very similar to the protection against a DDOS attack. And this version explains why the connection between A and B works, but between B and C does not. One data source ( A ) can be hooked to at most one middleware ( B ), but it can be hooked to multiple clients C . And it is calls to the same port from different nodes that are apparently interpreted as a DDOS attack. In this case, the first connected client remains operational, but the subsequent ones are cut off so that even pings from them do not pass.
Update4: found workaround. I pinged the machine running B to the hosts running C . After that, these hosts stopped cutting off. At the same time, it is not yet possible to understand what causes such a reaction of the system. ufw is disabled.
Update5: Issue resolved. The server should not terminate the connection, it is necessary to wait until the connection is terminated by the client (see Update to the comment).
Answer:
The causes of the problem have been found. There were two. One – in the test system, there are serious problems with the network, and even ping is lost by 30 percent. The second and main reason is in my code, the server closed the connection incorrectly. After processing the incoming connection and doing socket->write() , it hung on socket->waitForBytesWritten() (thus the server thread was blocked until the send was complete), and after that it did socket->close() . Apparently, the OS did not like such a forced closing of the socket. When I removed waitForBytesWritten() and did disconnectFromHost() instead of close( ) , everything worked.
Update: Not everything worked. The problem remains on the server, the workaround from Update4 helps as long as the configuration remains stationary. When we add a mobile client that connects from different places, from different IPs, it is no longer possible to add it to the ping script, and it falls off pretty soon. On clients, even stationary ones, messages about disconnection and restoration of communication sometimes appear. The solution to the problem was found only recently.
So: when the server began to close the socket more carefully, via disconnectFromHost , the problem became less acute. However, the correct solution turned out to be this: the server should not close the socket itself at all! It must respond to the disconnected and error signals that call deleteLater() .
When the server itself decides to close the socket after it has finished sending data, a signal race occurs. The disconnect signal from the server may arrive earlier than the server's response to the client, and then we get the situation on the client "the server has disconnected – the server has connected". And when a request to disconnect from the client reaches the server, and the socket is no longer there, apparently, an emergency occurs, and the accumulation of such situations leads to blocking the client's IP (under Linux, Windows, it seems, simply ignores such requests).
Bottom line: just removed disconnectFromHost() from the server, and the old client works (a week of testing 2 clients per 1 server ends) without a single loss of communication. Timeout has been reduced from 400-600ms to 10ms, and still no gaps. The main conclusion: the server should not close the connection itself, it should respond to the disconnected and error signals arising from the actions of the client and the data transfer medium.