Question:
Since there's an obsession with accented questions, here's mine 🙂
SQLite allows you to make direct or like
comparisons with any encoding/charset as long as it is done byte by byte . It only allows to make the comparison disregarding the case sensitivity (case insensitive) with ASCII characters.
I have some situations where I need to use SQLite with case insensitive and consequently accent insensitive also for latin characters (ISO 8859-1 or Latin1). I don't need and I don't use UTF-8 or UTF-16 encoding , because, and not only, there is a lack of implementation of the proper treatments of these encodings in SQLite.
The default answer from SQLite developers, which is english centric , is that if you need extra ASCII handling, you should provide it. They claim that you will already have this available in your application. Which is not true in my case.
I often use SQLite in languages that don't have the proper handling of case-normalizing characters taking into account Latin characters, at least not in a way that SQLite can call.
If I used UTF8 I would still not use ICU (en) because of its weight in memory and processing. It's a cannon to kill birds and in some cases it would bring me undesirable complications.
It would need to be in C to communicate directly with the SQLite API without overhead and to be portable to any language used in the application and also to the operating system, which is another requirement.
The main requirements:
- normalize case handling leading characters (disregarding accents) from ISO-8859-1
- lightness (memory and processing)
- simplicity (which allows for easy deployment and maintenance)
- portability (of language and operating system)
- avoid data duplication in tables (create normalized column)
- no external dependency or license that allows me to compile along with my application/SQLite.
It can be a simple library or a function to subscribe to the SQLite API. Preferably it also works with LIKE
and FTS
, UPPER
and LOWER
, etc.
I've searched and it doesn't solve my problem:
- https://stackoverflow.com/questions/973541/how-to-set-sqlite3-to-be-case-insensitive-when-string-comparing?lq=1 (only handles ASCII)
- https://stackoverflow.com/questions/15051018/localized-collate-on-a-sqlite-string-comparison (requires ICU)
- https://stackoverflow.com/questions/16806570/how-do-i-compare-characters-in-custom-sqlite-collation-in-objective-c (requires unavailable language and I don't even know if it solves every problem)
- https://stackoverflow.com/questions/2392732/sqlite-python-unicode-and-non-utf-data (depends on Python)
- https://stackoverflow.com/questions/9118273/using-only-utf8-encoding-in-sqlite-what-can-i-trim-out-of-the-icu-dataset (promising but doesn't solve or doesn't work as the expected)
Reference: SQLite API
Answer:
Sqlite is case-insensitive only for ASCII natively, because one of the goals of the database is precisely to be Lite 🙂 and there is the understanding that any application that depends on a specific language comparison will already have a function for that, since the database is just one of the points where it will be used.
To have an order in the way you want, using your own criteria, just implement a new collation, using the sqlite3_create_collation( )
(en) family of functions, which allows you to point your own function in C, which compares strings from the whichever way you find most convenient.
After having this collation implemented, just include it in the desired query:
SELECT nome, endereco FROM cadastro ORDER BY nome COLLATE CollationExemploPtBr
Details of the function to be implemented depend on the needs of each one, the important thing is to know that your function will receive the two strings to be compared, and should return 0 for strings considered equal (action and action, in Portuguese, should return 0), some positive for string1 greater than string2, and a negative for string1 less than string2.
One of the advantages of having your own collation is that you can not limit yourself to comparing characters just by accent, but also make & and E have the same "weight" when sorting a string, or whatever else is convenient. for the desired result.
As long as you call the create_collation function before you start using the database functions themselves, you can use your collation (or several different ones at the same time, as you wish) normally as index. This is especially important to maintain DB performance without losing the freedom to sort as you like. These functions work as efficiently or even better than the native ones in SQLite, depending only on the quality of the implemented code.
Remember that collation is only part of the process, because when it comes to locating a string with LIKE
, for example, you must also implement a compatible function, being able to use the create_function()
(en) family of functions that uses the same principle, point a function of your code that processes the result the way you want.
Here's a simplified example, adapted from a function I use in some applications:
// Atencao: esta implementacao esta muito simplificada,
// e foi postada como mero exemplo. Por ter sido rapidamente
// adaptada e simplificada de um codigo de uso interno para
// ser postada no SO, pode conter erros de tipagem e alguma
// condicao nao tratada (como utf mal formado) ou bugs.
//
// Esta implementacao somente esta considerando os caracteres
// acentuados entre u+0000 e u+00ff, para fins de exemplo, e
// considerando as equivalencias basicas em pt_BR
//
// -- Bacco
sqlite3 *db;
if (SQLITE_OK==sqlite3_open( "databasename.db", &db)
{
sqlite3_create_collation( db,
"CollationExemploPtBr",
SQLITE_UTF8, /* Nota [1] */
&example_table_ptbr, /* Nota [2] */
&example_collation );
// [1] A pergunta original menciona 8859-1. Notar que este codigo
// trata de UTF, mas basta ajustar os "ifs" pra ignorar UTF e
// usar a tabela com 256 caracteres "puros".
//
// [2] Notar que estou usando um cargo pra nao referenciar o
// example_table_ptbr direto no example_collation.
// O SQLite repassa esse pointer ao chamar a funcao indicada.
// Este e um otimo jeito de usar a mesma funcao com tabelas diferentes.
// Se sua funcao ja tiver a tabela embutida,
// basta usar null neste parametro.
// ... seu codigo aqui ...
}
static const char example_table_ptbr[] = {
/* u+0000 .. U+007F */
0x00,0x01,0x02,0x03,0x04,0x05,0x06,0x07, 0x08,0x09,0x0A,0x0B,0x0C,0x0D,0x0E,0x0F,
0x10,0x11,0x12,0x13,0x14,0x15,0x16,0x17, 0x18,0x19,0x1A,0x1B,0x1C,0x1D,0x1E,0x1F,
0x20,0x21,0x22,0x23,0x24,0x25,'E' ,0x27, 0x28,0x29,0x2A,0x2B,0x2C,0x2D,0x2E,0x2F,
0x30,0x31,0x32,0x33,0x34,0x35,0x36,0x37, 0x38,0x39,0x3A,0x3B,0x3C,0x3D,0x3E,0x3F,
0x40,0x41,0x42,0x43,0x44,0x45,0x46,0x47, 0x48,0x49,0x4A,0x4B,0x4C,0x4D,0x4E,0x4F,
0x50,0x51,0x52,0x53,0x54,0x55,0x56,0x57, 0x58,0x59,0x5A,0x5B,0x5C,0x5D,0x5E,0x5F,
0x60,'A' ,'B' ,'C' ,'D' ,'E' ,'F' ,'G' , 'H' ,'I' ,'J' ,'K' ,'L' ,'M' ,'N' ,'O' ,
'P' ,'Q' ,'R' ,'S' ,'T' ,'U' ,'V' ,'W' , 'X' ,'Y' ,'Z' ,0x7B,0x7C,0x7D,0x7E,0x7F,
/* u+0080 .. U+00FF */
0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20, 0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,
0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20, 0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,
0x20,'!' ,'C' ,'L' ,0x20,'Y' ,0x20,0x20, 0x20,'C' ,'A' ,'<' ,0x20,0x20,'R' ,0x20,
'O' ,0x20,'2' ,'3' ,0x20,'U' ,0x20,0x20, 0x20,'1' ,'O' ,'>' ,0x20,0x20,0x20,'?' ,
'A' ,'A' ,'A' ,'A' ,'A' ,'A' ,0x20,'C' , 'E' ,'E' ,'E' ,'E' ,'I' ,'I' ,'I' ,'I' ,
'D' ,'N' ,'O' ,'O' ,'O' ,'O' ,'O' ,'X' , 'O' ,'U' ,'U' ,'U' ,'U' ,'Y' ,0x20,'S' ,
'A' ,'A' ,'A' ,'A' ,'A' ,'A' ,0x20,'C' , 'E' ,'E' ,'E' ,'E' ,'I' ,'I' ,'I' ,'I' ,
'D' ,'N' ,'O' ,'O' ,'O' ,'O' ,'O' ,'/' , 'O' ,'U' ,'U' ,'U' ,'U' ,'Y' ,0x20,'Y' };
int example_collation(void* pArg,int n1,const void* s1,int n2,const void* s2)
{
int v = 0;
unsigned char b1;
unsigned char b2;
unsigned char *t = (unsigned char *)pArg;
unsigned char *p1 = (unsigned char *)s1;
unsigned char *p2 = (unsigned char *)s2;
unsigned char *c1 = p1 + n1;
unsigned char *c2 = p2 + n2;
while( ( p1 < c1 || p2 < c2 ) && v == 0) {
b1 = *(p1++);
b2 = *(p2++);
if ( p1 > c1 )
b1 = 0x20;
else if ( b1 < 0x80 )
b1 = t[ b1 ];
else if( ( b1 & 0xFE ) == 0xC2 )
b1 = t[ 0x80 | ( ( b1 << 6 ) & 0x40 ) | ( *(p1++) & 0x3F ) ];
if ( p2 > c2 )
b2 = 0x20;
else if ( b2 < 0x80 )
b2 = t[ b2 ];
else if( ( b2 & 0xFE ) == 0xC2 )
b2 = t[ 0x80 | ( ( b2 << 6 ) & 0x40 ) | ( *(p2++) & 0x3F ) ];
v = b1 - b2 ;
}
return v;
}