I don't have a problem with relational databases. Or rather, I don't have a prob...

dasil003 · on May 13, 2013

This is all pretty sane, except the schema-less part. I just don't understand why people get all hung up over schemas. Sure, migrations are a minor inconvenience, but if you just add fields in an ad-hoc fashion over time the data becomes messy and it's hard to determine any invariants about a large dataset. Sure this is avoided through careful code organization, but then aren't you just re-inventing schemas out-of-band?

stavros · on May 13, 2013

Yeah, I have experimented with schemaless stores a bit (mostly with JSON fields in postgres, for now), to avoid adding columns for various things to the database, and I am not impressed. Not only is initialization and migration hard, but I've come to realize that there's no downside in adding these to the schema itself.

Sure, the schema gets a bit bloated and dirty, but having undocumented fields in a dict whose existence is signified only by a line of code assigning something is not better.

Where schemaless stores are great is prototyping. Especially for more algorithmic code, where I don't really know what storage I'll be needing and the algorithm will only live there for a few days, schemas are just a burden. That's why I wrote Goatfish: https://github.com/stochastic-technologies/goatfish

gizzlon · on May 13, 2013

I think this is due to how you work and what you build. If you plan it out, focus on the data structures you need and then build it, schemas are fine because you know up-front what you want.

In more ad-hoc, constantly changing, "ops this didn't work because of unknown factor X" type of projects, schemas are a pain. It's sounds really nice to have a data store that adapts when it's impossible to know up-front what you need from your data structures.

yummyfajitas · on May 13, 2013

In more ad-hoc, constantly changing, "ops this didn't work because of unknown factor X" type of projects, schemas are a pain.

Such a pain:

    ALTER TABLE foo DROP COLUMN bar;
    ALTER TABLE foo ADD COLUMN baz varchar(64);

From what I've seen, a significant chunk of the desire to use schemaless NoSQL hipster DBs is simply a desire to avoid learning SQL as if it were a real programming language.

The only real use case I've ever seen for schemaless DBs is fields with lots of ad-hoc columns added by multiple people (typically logging/metrics data).

VLM · on May 13, 2013

"lots of ad-hoc columns added by multiple people"

Inevitably completely un-normalized and junk data. Even worse with no documentation or procedure anything ever added becomes permanent legacy that can never be removed. Been there, lived it, hated it, won't allow it to happen again.

yummyfajitas · on May 13, 2013

So you won't allow logfiles to happen again?

VLM · on May 13, 2013

LOL, no, not as a primary store of data, no never again.

Plain text logs are a great place to funnel all "unable to connect to database" type of errors for dbas / sysadmins to ponder, however.

I've implemented quite a few systems where data changes are logged into a log table, all fully rationalized, so various reports can be generated about discrepancies and access rates and stuff like that. This is also cool for endusers to see who made the last change etc.

Trying to reverse engineer how some data got into a messed up condition using logs that can be JOINed to actual data tables as necessary is pretty easy, compared to trying to write a webserver log file parser to read multiple files to figure out who did what, when, to the data resulting in it being screwed up. You parse log files for data just once before you decide to do that stuff relationally. Debug time drops by a factor of 100x.

DenisM · on May 13, 2013

I'd like to pick your brain as that's the problem I'm facing right now - I have a web site that is accessed by users, and I would like to get a comprehensive picture of what they do. I already have a log table for logging all changes (as you said - I can show it to the users themselves so they know who hanged what in a collaborative environment), but I struggle defining meaningful way to log read access - should I record every hit? Would that be too much data to store and process later? Should I record some aggregates instead?

VLM · on May 13, 2013

Rather than aggregates, sampling.

Sounds like you're not interested in absolutely every event for security audits / data integrity audits and more interested in general workflow. Be careful when selecting samples to store because what superficially looks random might not be (I donno, DB primary key, or timestamp ends in :01?). Is (one byte read from /dev/urandom )== 0x42 if so log it this time.

Also it depends on if you're unable to log everything because of sheer volume or lack of usable purpose or ... For example if you've got the bandwidth to log everything but not to analyze it in anything near realtime, maybe log Everything for precisely one random minute per hour. So this hour everything that happens at 42 minutes gets logged, everything at minute 02 next hour, whatever. Better mod60 your random number.

You can also play games with hashes IF you have something unique per transaction and you have a really speedy hash then a great random sampler would be hashing that unique stuff and then only log if the hash ends in 0x1234 or whatever.

If you have multiple frontends, and you REALLY trust your load balancer, then just log everything on only one host.

I've found that storing data is usually faster and easier than processing it. Your mileage may vary.

Another thing I've run into is its really easy to fill a reporting table with indexes making reads really fast, while killing write performance. So make two tables, one index-less that accepts all raw data and one indexed to generate a specific set of reports, then periodically copy some sample outta the index free log table and into the heavily indexed report table.

Its kinda like the mentality of doing backups. Look at how sysadmins spend time optimizing doing a full backup tape dump and sometimes just dump a delta from the last backup.

You're going to have to cooperate with operations to see what level of logging overloads the frontends. There's almost no way to tell other than trying it unless you've got an extensive testing system.

I've also seen logging "sharded" off onto other machines. Lets say you have 10 front ends connecting to 5 back ends, so FE#3 and FE#4 read from BE#2 or whatever. I would not have FE3 and FE4 log/write to the same place they're reading BE2. Have them write to BE3 or something, anything than the one they're reading from. Maybe even a dedicated logging box so writing logs can never, ever, interfere with reading.

Another strategy I've seen which annoys the businessmen is assuming you're peak load limited, shut off logging at peak hour. OR, write a little thermostat cron job or whatever where if some measure of system load or latency exceeds X% then logging shuts off until 60 minutes in the future or something. Presumably you have a test suite/system/load tester that figured out you can survive X latency or X system load or X kernel level IO operations per minute, so if you exceed it, then all your FE flip to non-logging mode until it drops beneath the threshold. This is a better business plan because instead of explaining to "the man" that you don't feel like logging at their prime time, you can provide proven numbers that if they're willing to spend $X they could get enough drive bandwidth or whatever such that it would never go in log limiting mode. Try not to build an oscillator. Dampen it a bit. Like the more percentage you exceed the threshold in the past, the longer logging is silenced in the future, so at least if it does oscillate it won't flap too fast.

One interesting strategy for sampling to see if it'll blow up is sample precisely one hour. Or one frontend machine. And just see what happens before slowly expanding.

Its worth noting that unless you're already running at the limit of modern technology, something that would have killed a 2003 server is probably not a serious issue for a 2013 server. What was once (even recently) cutting edge can now be pretty mundane. What killed a 5400 RPM drive might not make a new SSD blink.

(Whoops edited to add I forgot to mention that you need to confer with decision makers about how many sig figs they need and talk to a scientist/engineer/statistician about how much data to obtain to generate those sig figs... if the decision makers actually need 3 sig figs, then storing 9 sig figs worth of data is a staggering financial waste but claiming 3 sig figs when you really only stored 2 is almost worse. I ran into this problem one expensive time.)

seestheday · on May 13, 2013

It sounds to me like you're trying to reinvent web analytics. Is there a reason you need user level granularity or is aggregate data enough?

DenisM · on May 13, 2013

My web site has separate "accounts" for multiple companies, each has multiple users. I'd like three level of analytics - for a given company (both me and the company agent would like to see this), across all companies for all users (I will see this), and rolled up within each company (i.e. higher-level of activity where it's companies are being tracked, not individual users).

User-level data might be useful for tech support (although this is currently working fine with text-based log files and a grep).

So I guess I am not sure... I might be content with web analytics... Each company has its own URL in the site, like so: http://blah.com/company/1001/ViewData, http://blah.com/company/1002/ViewData, etc. Using e.g. Google Analytics I could see data for one company easily, but can I see data across all companies (how many users look at ViewData regardless of company id)? Can I delegate the owner of company 1001 to see only the part of the analytics?

Another monkey wrench is the native iPad app - ideally the analytics would track users across both native apps and the web site.

geon · on May 14, 2013

> using logs that can be JOINed to actual data tables as necessary is pretty easy

Can you give me a concrete example of how you would use this?

VLM · on May 14, 2013

If by concrete you mean business case example, it comes up a lot in "the last three people to modify this record were..."

Security audit trail type stuff "Why is this salesguy apparently manually by hand downloading the entire master customer list alphabetically?".

This isn't all doom and gloom stuff either... "You're spending three weeks collecting the city/state of every customer so a marketing intern can plot an artsy graph by hand of 10000 customers using photoshop? OMG no, watch what I can do with google maps/earth and about 10 lines of perl in about 15 minutes" Or at least I can run a sql select that saves them about 50000 mouse clicks in about 2 minutes of work. Most "suit" types don't get the concept of a database and see it as a big expensive Excel where any query more complicated than select * is best done by a peon by hand. I've caught people manually alphabetizing database data in Word for example.

Another thing that comes up a lot in automation is treating a device differently WRT monitoring and alerting tools if a change was logged within the last 3 days. So your email alert for a monitored device reads contains a line something like "the last config change made was X hours ago by ...". Most of the time when X=0 or X=1 the alert is because ... screwed up, and when it isn't a short phone call to ... is problem isolation step #1.

This was all normal daily operations business use cases, aside from the usual theoretical data mining type stuff, like a user in A/B marketing test area "B" tends to update data table Q ten times more often than marketing test area "A" or whatever correlation seems reasonable (or not).

geon · on May 15, 2013

That sounds reasonable. I've been thinking about doing logging as a text blob on the affected object, but it haven't seemed useful enough.

Using your approach, I guess it would be a table like

    create table logs (
        tableName varchar,
        oldValue text,
        newValue text,
        userID int,
        when datetime
    );

Or did I miss something?

jes5199 · on May 13, 2013

I've worked on systems where changing a single column meant that we'd have to take four hours of downtime while MySQL slowly...did whatever it does.

VLM · on May 14, 2013

Is this the old bug (or whatever) from many years ago where it was faster to drop the indexes, make your schema change, then add the indexes back? Not sure if that still applies anymore. The scenario was something like if you change the length of a column such that the index needs to recalc it, like maybe truncate a CHAR(15) to a CHAR(10), or do something weird with full text indexes, it would loop thru each row, recalc the row for the index, completely sort and insert the row into the index, and then repeat for the next line. So it scaled as if you were inserting each row one line at a time (which can be pretty slow with lots of rows and extensive indexes) but there's a sneakier way to do it.

Or, if by change a column, you mean something like an "update table blah set x=x+1;", and the x column was part of an index, that used to really work the indexing system hard, one individual row at a time. I think that issue was optimized out a long time ago. I believe there was a sneaky way to optimize around it other than the index drop and create trick, by doing all 10 million increments as part of a transaction such that it would do all 10 million increments, close out the transaction, then recalculate the index. Now there was something sneaky to the sneaky that you couldn't do a transaction on one update so you updated all the "prikey is even" and then updated all the "prikey is odd" or something like that as a two part transaction. I didn't exactly do this last week so if I misremember a detail...

dasil003 · on May 13, 2013

> It's sounds really nice to have a data store that adapts when it's impossible to know up-front what you need from your data structures.

This seems backwards to me. Relational databases are much better at ad-hoc querying of data, whereas NoSQL scales for narrower access patterns. The fact that you can dump arbitrary columns without a migration is a nice convenience, but in general it will be less queryable than it would be if you added the column in a SQL database.

raverbashing · on May 13, 2013

So that you can store/retrieve data without knowing the schema.

Oh but you have to know the schema right? Yes, some other part of the application knows the schema, but this part doesn't have authority over the DB. Also, the schema may be data as well.

NoSQL reduces the work needed for that.

vasyainv · on May 13, 2013

Perhaps, but can't you just organize your objects by having a table for each type and adding a column as needed? it doesn't sound like such a big deal

raverbashing · on May 13, 2013

Looks like fun

But why would I waste developer time doing that if I can only do db.table.insert(obj) - in MongoDB for example (obj is a JS object)

Also, finding all objects with a field named 'field1' and value '3' is slower if you do that in a relational DB (and that's the simplest case)

mayank · on May 13, 2013

Why on earth would that be slower in a relational db? That makes no sense, since indices in both nosql and relational dbs are variants of b-trees.

raverbashing · on May 13, 2013

Well, it's slower because of a join that exists in SQL (the relationship between your 'field/value' table and the entry. Apart from that, as you said indexes are similar.

The fun thing about NoSQL skeptics is how they think of only the current scenarios they work with, and they won't believe you until they get burned by it. So be it.

vasyainv · on May 13, 2013

Well, because presumably SQL databases have features you don't get with noqsl solutions.

derefr · on May 13, 2013

> but if you just add fields in an ad-hoc fashion over time the data becomes messy and it's hard to determine any invariants about a large dataset

In systems large enough to be running multiple versions of an app at the same time, talking to the same database, you have to do exactly this--but you have to shoehorn it into relations (doing things like having a field where 99.999% of entries in a column are null, and gradually get computed as the row is accessed by the new version of the app.) NoSQL lets you just say what you mean--that in some versions of the app, the schema is X, in some versions it's Y, and the database isn't the arbiter of the schema.

sergiosgc · on May 13, 2013

No. NoSQL lets you say nothing about the schema, and so it becomes a problem for the application layer, above the DB, to handle. In fact, this is much what happens with most "solutions" NoSQL presents to SQL database problems: Let's not implement it, then it's not a problem.

What happens when you push problems up the stack? Do they get solved automatically? No? Will they get solved? Perhaps, if really needed. And, for the cherry on top: Will the solutions be similar to the ones SQL databases use? They will.

You see, when you are implementing atomic transactions, for example, you may get ahead if you have some information about the problem domain. However, for most cases, you are solving the same problem SQL databases solved decades ago. And you'll find the same solution. Just not as well implemented nor as well tested.

VLM · on May 13, 2013

"Let's not implement it, then it's not a problem."

Its interesting that culturally more than a decade ago, when mysql tried this strategy with transactions, namely, not having them until roughly the turn of the century, it was reviled mostly by people who don't know what transactions are nor did they need them, but were nonetheless very unhappy about mysql not having that checkbox get checked.

Now its culturally seen as a huge win to simply not implement something difficult.

I don't know if its a decline in feature list length as a fetish or just simple copy catting of others behavior (perhaps in both situations) or some kind of pull yourself up by your bootstraps romantic outlook on reimplementation or the inevitable result of homer simpson meets the database, but whatever it is, its an interesting major cultural change.

sergiosgc · on May 13, 2013

You wouldn't say that if you worked with mysql daily. I do, and every single day I long for the times when my stack used pgsql. Mysql is such an unfixable clusterfuck, with minimal speed advantages over real RDBMSs, that if you use it as the poster child of nosql's path you are effectively arguing against yourself.

gizzlon · on May 13, 2013

I see it more as a move from "one size fits all" to more specialized tools. The term "nosql" is pretty useless as it's way too general. Both your comments about "not implementing the hard stuff", and GP's about schema, only applies to some of the "nosql" projects.

Instead of looking at it as "aaawwm! NoSQL is attacking our bellowed RDBMS", try looking at the different projects and what they bring to the table. Maybe some of them can be a useful addition to your systems.

physicslover · on May 13, 2013

Have to agree with the above. The problem with relational databases isn't the relational model, per se, but the complexity and cost of maintaining a relational database.

Typically, they require specialized database administrators whose primary job is to tune the database and keep it running.

Many businesses, even of moderate size, reach point where they will need to purchase expensive hardware (million dollar RamSans and expensive servers) to optimize the performance of their database because partitioning databases is challenging.

So the overhead of running an Oracle or Sql Server database is quite high.

There is huge room for improvement with these traditional database products. If someone made a good cloud database that supported the same feature set but with lower administration and maintenance costs then that might be a better option.

rmrfrmrf · on May 13, 2013

Now that is a reasonable concern. Keeping db hardware happy is certainly an expensive undertaking. I think I'm more used to the arguments like the one from Sauce Labs, where the VP whines, "What are schemas even for? They just make things hard to change for no reason. Sometimes you do need to enforce constraints on your data, but schemas go way too far," [1] and then goes on to say that his company is moving from using CouchDB to using MySQL as a key-value store with serialized JSON (data integrity and performance be damned -- I mean, really, the thought of converting millions of values in a table to objects just to run a home-grown MapReduce function on them when you could just LEARN HOW TO USE MySQL is pretty much the most insane thing I've ever heard lol).

Do you have any experience with Amazon RDS? I haven't tried it; I guess my concern would be the same as any other AWS product--they tend to fail catastrophically from time to time. Then again, if you're doing cloud NoSQL through Amazon, you're going to run into the same issues (see: Reddit).

[1](http://sauceio.com/index.php/2012/05/goodbye-couchdb/)

badclient · on May 13, 2013

The problem of complexity has less to do with being relational and more to do with the data just being large and complex. Relational or not doesn't change that much. If anything, by not keeping it relational, you are much more likely to have a disorganized database that isn't normalized.

This whole "specialized database administrator" point just seems moot considering the equivalent for that are the so-called Big Data developers.