I'm working on two systems that have a common feature: entities whose attributes are dynamic (ie cannot be " hardcoded "), need to be consulted in search operations (result filter), but otherwise they are decoupled from the rest of the system (ie there is nothing referencing them, nor are they used in any specific calculation). I'm having trouble finding an appropriate way to represent them.
So far, I've been using the (anti-) EAV pattern. There is an
Entity table that identifies the entity, a
Tag table (
Entity ) that says the "type" – or "class" – of the entity, an
Attribute table (
Tag ) that identifies the property, its type, and if may or may not be absent, and finally several
XValue tables (
DateValue etc) that say "the entity E has the value V for the attribute A".
In context, an entity could be a product for sale, its tag the type of product (e.g. "car"), attributes being the characteristics of a car (e.g. "brand", "model", "kilometers driven" ) and the values for the application of those characteristics to the product for sale (eg "Fiat", "Palio", "20000"). Product types can be created, changed and deleted (dynamic), you can query a product type by its characteristics (eg filter cars by "Fiat" brand), but there is no foreign key for a specific attribute, and there will never be any "non-generic" calculations involving product characteristics.
And, as I said, I currently implement this in the form of EAV models – in a PostgreSQL database, cross-platform at first but more focused on Linux.
Usually the use of EAV is a "potential WTF alert", but I believe this would be one of its legitimate uses (data structures modeled this way are actually data , not metadata ). The vast majority of the bank is formalized, only a few entities were modeled this way (and I took steps to ensure that it would be possible to formalize any model that might be coupled to the rest of the system).
However, I am looking for alternatives to this model – as it is cumbersome to deal with and full of pitfalls . I thought of representing each entity by a simple XML, but there is the requirement of filtered search – which I don't know how to do [efficiently]. I have no experience with non-relational ( "NoSQL" ) DBMSs, so I don't know if their performance characteristics would be acceptable (I'm predicting a high number of reads – filtered searches – simultaneous as well as simultaneous writes – even if not to the same entity ) . Other ideas came to my mind, but they seem too "crazy" to use in practice (eg dynamically changing the DB schema whenever a "type" is changed).
I ask that the answers be based on previous experience involving systems with similar requirements, not just opinion and/or theoretical basis, if possible. Specific technology suggestions are welcome, but what I'm looking for is more of a solution strategy rather than a software recommendation .
Update: For those who are not familiar with the "Entity-Attribute-Value" (unfortunately I couldn't find any material in Portuguese on the subject), here is a concrete example:
In a traditional (formalized) model, a "Car" entity could be represented like this:
Carros ID Marca Modelo Km ---------------------------------- 1 Fiat Palio 20000 2 Honda Fit 10000
In an EAV modeling, this is done like this:
Entidades Atributos Valores ID Tipo ID Nome Entidade Atributo Valor ---------- ------------- ----------------------- 1 Carro 1 Marca 1 1 Fiat 2 Carro 2 Modelo 1 2 Palio 3 Km 1 3 20000 2 1 Honda 2 2 Fit 2 3 10000
The advantage of this representation is that the "Car" type is not hardcoded : if I want to add a "Color" attribute, I don't need to change the table structure, create new queries to deal with this attribute, etc – just create one more row in the
Atributos table and for each car one more row in the
Valores table. If besides cars I want to represent something else (eg real estate), just create rows in the
Entidades table with this new type, and give it attributes in the same way as "Cars".
The downsides are… well, every imaginable one ! Generally speaking, this is a pattern to be avoided at all costs (ie an "anti-pattern"). However, even the biggest critics admit that – in certain cases – it is unavoidable. My question is: 1) does this apply to my case? or is there a formalized way to meet my requirements? 2) if it is inevitable, if there are good alternatives that have acceptable performance.
This is a typical use case for NoSQL tools, especially tools in the document-oriented database family (like MongoDB ). Other possible/complementary alternatives are enterprise search tools (like Elasticsearch ).
The key word here is schema-less , a feature with advantages and disadvantages (see this presentation by Martin Fowler on the subject).
I have had good experience with large systems of:
- E-commerce ⇒ They usually have very complicated relational models to represent attributes / attribute groups / category hierarchy / category attributes. etc.
- Positions and Resumes ⇒ Usually deal with unstructured textual information. Algorithms for determining relevance are extremely important.
- Telco ⇒ Store diverse configurations that are difficult to structure in a relational model.
In fact, from a certain volume of data, as well as for certain read and write characteristics, enterprise search , NoSQL or both are pretty much necessary to keep the system responsive.
How it works
In both cases it is possible to create documents with different attributes (eg, in JSON format), including complex attributes ( arrays , nested types, etc). Documents with different attributes can be grouped into the same category (ie a document's type does not define its attributes).
When you search for a certain document (eg by id ), you get all the information that persisted about that document (ie, all attributes).
When you make a query using a certain attribute, the engines search only the documents that contain that attribute.
In practice, your application ends up knowing and defining the rules for the attributes (eg, a car will always have
ano , but only imported cars have the country of origin attribute) and you make "sane" queries on top of these rules. It's what Martin Fowlwer calls implicit schemas .
Performance: The correct use of these solutions is much more efficient than the EVA standard in relational databases. These solutions are designed to quickly scale and return query results on massive data sets .
Queries: Searches by types, ranges, logical searches, sorting, scoring , full text search , in short, all kinds of things can be done. Of course, tools have their own characteristics. It takes a while for you to understand how to structure information efficiently, as well as getting rid of the addiction to aggregate functions, having clauses and the like (although even this type of query can be done with MapReduce solutions).
ACID/BASE : As with relational databases, NoSQL tools can often be configured to support different consistency/availability/fault tolerance characteristics (see CAP Theorem ). Each tool also has its own transaction and serialization mechanisms*.
Practicality: Persist your documents and make all kinds of queries (as well as updates , deletes , partial updates , etc). Everything is very simple (the learning curve of Relational DBMSs, in my view, is much steeper).
* Specific to the mentioned tools: You will have no problem with concurrent reads or writes. I just recommend keeping at least two nodes up at all times (especially for MongoDB). The intention here is not (only) to guarantee high availability, but to get the most out of the tools. I recommend doing this even for the development environment… This task is much simpler than it looks (I did local configurations with VMs on my laptop, I didn't waste more than an hour configuring MongoDB and Elasticsearch).