Sphinx has the ability to search based on the presence of words in the same sentence. For example, there is a text:
Vasya is great, he ate a cucumber, because hungry. So it goes.
If you request
молодец SENTENCE огурец
Then we will find this text. If you request
молодец SENTENCE проголодался
Then we will not find this text, since apparently in Sphinx the implementation of the breakdown into sentences is implemented in a simple way and the first dot that comes across here is considered the end of the sentence. So the question is.
How can Sphinx be configured with a smarter sentence breakdown when preparing an index? Any option will do – indicate something in the configs or slip an external package for splitting into sentences, for example, Tomita parser from Yandex.
There was an idea to break into sentences in advance using the Tomita parser and tell the Sphinx to use a newline as a sentence separator, but judging by the Sphinx sources, this is unlikely to succeed .
The solution that suits.
Use the Tomita parser to split the text into sentences. We get a text in which sentences are separated by line feeds.
In each received sentence, delete all dots, exclamation and question marks, leaving only the last dot, "?" or "!".
Based on this processed data, we build an index in Sphinx. The splitting into sentences will occur as needed, since Sphinx splits the text into sentences when it finds ".", "?" or "!".