I'm working on two systems that have a common characteristic: entities whose attributes are dynamic (ie cannot be "cast" – hardcoded ), need to be consulted in search operations (results filter), but otherwise are decoupled from the rest of the system (ie there is nothing referencing them, nor are they used in any specific calculation). I'm having trouble finding an appropriate way to represent them.
So far I have been using the (anti-)standard EAV . There is an
Entity table that identifies the entity, a
Tag table (
Entity ) that says the "type" – or "class" – of the entity, an
Attribute table (
Tag ) that identifies the property, its type, and if it may or may not be absent, and finally several
XValue tables (
DateValue etc) that say "entity E has the value V for attribute A".
In context, an entity could be a product for sale, its tag the type of product (eg "car"), the attributes being the characteristics of a car (eg "brand", "model", "kilometres driven" ) and the values of the application of those characteristics to the product for sale (ex.: "Fiat", "Palio", "20000"). Product types can be created, changed and deleted (dynamic), you can query a product type by its characteristics (eg filter cars by the "Fiat" brand), but there is no foreign key for a specific attribute, and there will never be any "non-generic" calculations involving product characteristics.
And as I said, I currently implement this in the form of EAV models – in a PostgreSQL database, cross-platform at first but more Linux-oriented.
Normally the use of EAV is a "potential WTF alert", but I believe this would be one of its legitimate uses (the data structures modeled this way are actually data , not meta-data ). The vast majority of the bank is formalized, only a few entities were modeled in this way (and I took steps to ensure that it would be possible to formalize any model that might be coupled with the rest of the system).
However, I am looking for alternatives to this model – as it is cumbersome to deal with and full of pitfalls . I thought of representing each entity by a simple XML, but there is the requirement of filtered search – which I don't know how to do [efficiently]. I have no experience with non-relational DBMS ( "NoSQL" ), so I don't know if their performance characteristics would be acceptable (I'm predicting a high number of reads – filtered searches – simultaneous, as well as simultaneous writes – even if not to the same entity ) . Other ideas came to my mind, but they seem too "crazy" to use in practice (eg dynamically changing the BD schema whenever a "type" is modified).
I ask that responses be based on previous experience involving systems with similar requirements, not just opinion and/or theoretical basis, if possible. Suggestions for specific technologies are welcome, but what I'm looking for is more of a solution strategy , not a software recommendation .
Update: For those who are not familiar with the "Entity-Attribute-Value" (unfortunately I didn't find any material in Portuguese on the subject), here is a concrete example:
In a traditional (formalized) modeling, a "Car" entity could be represented like this:
Carros ID Marca Modelo Km ---------------------------------- 1 Fiat Palio 20000 2 Honda Fit 10000
In an EAV modeling, this is done like this:
Entidades Atributos Valores ID Tipo ID Nome Entidade Atributo Valor ---------- ------------- ----------------------- 1 Carro 1 Marca 1 1 Fiat 2 Carro 2 Modelo 1 2 Palio 3 Km 1 3 20000 2 1 Honda 2 2 Fit 2 3 10000
The advantage of this representation is that the "Car" type is not hardcoded : if I want to add a "Color" attribute, I don't need to mess with the table structure, create new queries to handle this attribute, etc – just create one more row in the
Atributos table and for each car one more row in the
Valores table. If I want to represent something else in addition to cars (eg real estate), just create rows in the
Entidades table with this new type, and give it attributes in the same way as "Cars".
The disadvantages are… well, every possible and imaginable one ! Generally speaking, this is a pattern to be avoided at all costs (ie an "anti-pattern"). However, even the greatest critics admit that – in certain cases – it is unavoidable. My question is: 1) does this apply to me? or is there a formalized way to meet my requirements? 2) whether it is unavoidable, whether there are good alternatives that have acceptable performance.
This is a typical use case for NoSQL tools, in particular for tools in the document-oriented database family (like MongoDB ). Other possible/complementary alternatives are enterprise search tools (such as Elasticsearch ).
The keyword here is schema-less , a feature with advantages and disadvantages (see Martin Fowler's presentation on the subject).
I have had good experience with large systems of:
- E-commerce ⇒ Often have very complicated relational models to represent attributes / attribute groups / category hierarchy / category attributes. etc.
- Jobs and Resumes ⇒ Usually deal with poorly structured textual information. Algorithms for determining relevance are extremely important.
- Telco ⇒ Stores different configurations that are difficult to structure in a relational model.
In fact, from a given volume of data, as well as for certain read and write characteristics, enterprise search , NoSQL or both are practically necessary to keep the system responsive.
How it works
In both cases it is possible to create documents with different attributes (eg, in JSON format), including complex attributes ( arrays , nested types, etc). Documents with different attributes can be grouped into the same category (ie a document's type does not define its attributes).
When you search for a particular document (for example, by id ), you get all the information that persisted about that document (ie, all attributes).
When you make a query using a certain attribute, the engines do a search only on documents that contain that attribute.
In practice, your application ends up knowing and defining the rules for the attributes (eg, a car will always have
ano , but only imported cars have the country of origin attribute) and you make "sane" queries on top of these rules. These are what Martin Fowlwer calls implicit schemas .
Performance: The correct use of these solutions is much more efficient than the EVA standard in relational databases. These solutions are designed to quickly scale and return query results in giant data sets .
Queries: Searches by types, ranges, logical searches, ordering, scoring , full text search , in short, all sorts of things can be done. Of course, tools have their own characteristics. It takes a while for you to figure out how to efficiently structure information, as well as get rid of the addiction to aggregate functions, having clauses and the like (although even that kind of query can be done with MapReduce solutions).
ACID / BASE : As with relational databases, NoSQL tools can often be configured to support different consistency / availability / fault tolerance characteristics (see CAP Theorem ). Each tool also has its own transaction and serialization mechanisms* .
Practicality: Persist your documents and make all kinds of queries (as well as updates , deletes , partial updates , etc). Everything is very simple (the learning curve of Relational DBMS, in my opinion, is much steeper).
* Specific to the mentioned tools: You won't have any problems with concurrent reads or writes. I just recommend that you keep at least two nodes in the air at all times (especially for MongoDB). The intuition here is not (only) to guarantee high availability, but to extract the most from the tools. I recommend doing this even for the development environment… This task is much simpler than it looks (I did local configurations with VMs on my laptop, I didn't spend more than an hour configuring MongoDB and Elasticsearch).