Showing posts with label Mysql. Show all posts
Showing posts with label Mysql. Show all posts

Thursday, November 19, 2009

NoSQL Ecosystem

I found a very interesting article entitled "NoSQL Ecosystem" from The Rackspace Could. It dove into the fact that we need an alternative for relational databases to handle the high volume of data these days. Some examples the article gave:

The fundamental problem is that relational databases cannot handle many modern workloads. There are three specific problem areas: scaling out to data sets like Digg’s (3 TB for green badges) or Facebook’s (50 TB for inbox search) or eBay’s (2 PB overall), per-server performance, and rigid schema design.

It's a good read, I would like to hear your thoughts around this NoSQL Movement idea.

By Jonathan Ellis, Systems Architect

Unprecedented data volumes are driving businesses to look at alternatives to the traditional relational database technology that has served us well for over thirty years. Collectively, these alternatives have become known as “NoSQL databases.”

The fundamental problem is that relational databases cannot handle many modern workloads. There are three specific problem areas: scaling out to data sets like Digg’s (3 TB for green badges) or Facebook’s (50 TB for inbox search) or eBay’s (2 PB overall),per-server performance, and rigid schema design.

Businesses, including The Rackspace Cloud, need to find new ways to store and scale large amounts of data. I recently wrote a post on Cassandra, a non-relational database we have committed resources to. There are other non-relational databases being worked on and collectively, we call this the “NoSQL movement.”

The “NoSQL” term was actually coined by a fellow Racker, Eric Evans when Johan Oskarsson of Last.fm wanted to organize an event to discuss open source distributed databases. The name and concept both caught on.

Some people object to the NoSQL term because it sounds like we’re defining ourselves based on what we aren’t doing rather than what we are. That’s true, to a degree, but the term is still valuable because when a relational database is the only tool you know, every problem looks like a thumb. NoSQL is making people aware that there are other options out there. But we’re not anti-relational-database for when that really is the best tool for the job; it’s “Not Only SQL,” rather than “No SQL at all.”

One real concern with the NoSQL name is that it’s such a big tent that there is room for very different designs. If this is not made clear when discussing the various products, it results in confusion. So I’d like to suggest three axes along which to think about the many database options: scalability, data and query model, and persistence design.

I have chosen 10 NoSQL databases as examples. This is not an exhaustive list, but the concepts discussed are crucial for evaluating others as well.

Scalability

Scaling reads is easy with replication, so when we’re talking about scaling in this context, we mean scaling writes by automatically partitioning data across multiple machines. We call systems that do this “distributed databases.” These include Cassandra, HBase, Riak,Scalaris, Voldemort, and more. If your write volume or data size is more than one machine can handle then these are your only options if you don’t want to manage partitioning manually. (You don’t.)

There are two things to look for in a distributed database: 1) support for multiple datacenters and 2) the ability to add new machines to a live cluster transparently to your applications.

Non-distributed NoSQL databases include CouchDB, MongoDB, Neo4j, Redis, and Tokyo Cabinet. These can serve as persistence layers for distributed systems; MongoDB provides limited support for sharding, as does a separate Lounge project for CouchDB, and Tokyo Cabinet can be used as a Voldemort storage engine.

Data and Query Model

There is a lot of variety in the data models and query APIs in NoSQL databases.

(Respective Links: Thrift, map/reduce views, Thrift, Cursor, Graph, Collection, Nested hashes, get/put, get/put, get/put)

Some highlights:

The columnfamily model shared by Cassandra and HBase is inspired by the one described by Google’s Bigtable paper, section 2. (Cassandra drops historical versions, and addssupercolumns.) In both systems, you have rows and columns like you are used to seeing, but the rows are sparse: each row can have as many or as few columns as desired, and columns do not need to be defined ahead of time.

The Key/value model is the simplest and easiest to implement but inefficient when you are only interested in querying or updating part of a value. It’s also difficult to implement more sophisticated structures on top of distributed key/value.

Document databases are essentially the next level of Key/value, allowing nested values associated with each key. Document databases support querying those more efficiently than simply returning the entire blob each time.

Neo4J has a really unique data model, storing objects and relationships as nodes and edges in a graph. For queries that fit this model (e.g., hierarchical data) they can be1000s of times faster than alternatives.

Scalaris is unique in offering distributed transactions across multiple keys. (Discussing the trade-offs between consistency and availability is beyond the scope of this post, but that is another aspect to keep in mind when evaluating distributed systems.)

Persistence Design

By persistence design I mean, “how is data stored internally?”

The persistence model tells us a lot about what kind of workloads these databases will be good at.

In-memory databases are very, very fast (Redis achieves over 100,000 operations per second on a single machine), but cannot work with data sets that exceed available RAM. Durability (retaining data even if a server crashes or loses power) can also be a problem; the amount of data you can expect to lose between flushes (copying the data to disk) is potentially large. Scalaris, the other in-memory database on our list, tackles the durability problem with replication, but since it does not support multiple data centers your data will be still be vulnerable to things like power failures.

Memtables and SSTables buffer writes in memory (a “memtable”) after writing to an append-only commit log for durability. When enough writes have been accepted, the memtable is sorted and written to disk all at once as a “sstable.” This provides close to in-memory performance since no seeks are involved, while avoiding the durability problems of purely in-memory approaches. (This is described in more detail in sections 5.3 and 5.4 of the previously-referenced Bigtable paper, as well as in The log-structured merge-tree.)

B-Trees have been used in databases since practically the beginning of time. They provide robust indexing support, but performance is poor on rotational disks (which are still by far the most cost-effective) because of the multiple seeks involved in reading or writing anything.

An interesting variant is CouchDB’s append-only B-Trees, which avoids the overhead of seeks at the cost of limiting CouchDB to one write at a time.

Conclusion

The NoSQL movement has exploded in 2009 as an increasing number of businesses wrestle with large data volumes. The Rackspace Cloud is pleased to have played an early role in the NoSQL movement, and continues to commit resources to Cassandra and support events like NoSQL East.

NoSQL conference announcements and related discussion can be found on the Google discussion group.

Source: http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/

Wednesday, November 4, 2009

MySQL University: GRAPH computation engine for MySQL

One of the things relational databases are not very good at is handling structures like trees (for example structure of categories in forum) or graphs (for example network of friends on social networking site).

GRAPH is a MySQL plugin which promises to handle such data in an easy and efficient manner using normal SQL syntax. I don't know any details, as MySQL University announcement was the first time I ever heard of this. Nevertheless the topic seems interesting, so if you would like to know more, attend MySQL Uni session this Thursday, November 5th at 10:00 GMT.

As usual session recording in flash format will also be available at a later time.

More information about GRAPH is available at OpenQuery's site

For more information about MySQL University, go to my introductory post

Source: http://www.phpfreaks.com/blog/mysql-university-graph-computation-engine-for-mysql

Friday, May 29, 2009

MySQL University: Boosting Performance With MySQL 5.1 Partitioning

It's been some time some time since we had a MySQL Uni session subject that could be interesting to an 'average' PHP developer. The one that's going to take place next Thursday will definitely be of this sort.

Introduced in MySQL 5.1, partitioning lets you divide your tables into smaller chunks of data... while still keeping them in one table. No more need for `invoices2006`, `invoices2007`, `invoices2008` tables to cope with archival data, no more wicked UNIONs. Now you can tell MySQL to keep invoices (or whatever you need to keep) from one year separate from all other years, thus improving queries that only deal with records from one year.

That's just a simple example, as there are much more to partitioning than that. I urge you therefore to attend MySQL Uni's Dimdim session this Thursday, June 4th at 13:00GMT.

The session will be presented by Giuseppe Maxia a.k.a. The Data Charmer. I attended one of his sessions in the past, and he seems like a very interesting person to listen to.Check the MySQL University home page for details (and possible last minute schedule changes!)For more information about MySQL University, see my introductory post

Saturday, April 25, 2009

Too much information about the MySQL TIMESTAMP

Source:http://www.gizmola.com/blog/archives/93-Too-much-information-about-the-MySQL-TIMESTAMP.html

The MySQL timestamp is an oddity, being both a mySQL "Data Type" as well as a type of specialty column that provides a built in default. It doesn't help matters, that the timestamp was changed significantly around mysql version 4.1.

The Old TIMESTAMP


In older mysql versions, the TIMESTAMP was not in the same format as a DateTime column, and you could also set up truncation by defining the TIMESTAMP to have a fixed size. For example, you could define a TIMESTAMP column to be a TIMESTAMP(4) which would then only store the 4 digit Year portion of a DateTime value. I won't go into much detail on the pre version 4.1 TIMESTAMP, however, if you're stuck with an older version of MySQL I recommend you read the manual carefully before you attempt to use any of the information here. I'm going to concentrate on the current TIMESTAMP.

TIMESTAMP Properties


At its most fundamental, the TIMESTAMP is really nothing more than a Unix TimeStamp, which is to say, that internally it is stored as an integer value of seconds. Where a MySQL DATETIME column can be used to store any date and time from Jan 1, 1000 to 12/31/9999, the TIMESTAMP is limited in the same ways that the Unix timestamp is currently limited -- it can only store values from Jan 1, 1970 to Jan 9, 2038.

Those familiar with Unix design, will recognize the Jan 9, 2038 date as being the next big "Y2K" computing panic, and if you're young enough, you may realize a large payday in your future, selling remediation services to companies in roughly another 28 years. The folks at http://www.y2038.com/ are already estimating this to be as much as a 10 trillion dollar jackpot, although no doubt by that time most of the coding will be done by the Morlocks from their underground cave cities. Outsourcing of IT to Morlocks will be a major industry trend by the year 2020, mark my words.

Saving bytes


MySQL stores a timestamp as a 32 bit integer, which of course requires 4 bytes of storage. This is one reason why you might want to use a TIMESTAMP over a DATETIME, which requires 8 bytes. Primarily, people look to the TIMESTAMP because, as its name implies, it can be utilized to stamp the time on a row at the point it's inserted. Let's take a look at a mysql TIMESTAMP in action.

First let's create a table with a single TIMESTAMP column in it. We will not refer to this column, and observe what happens when we insert a new row into the table. In case you're keeping score:
mysql> select @@version;
+-----------+
| @@version |
+-----------+
| 5.0.45 |
+-----------+
1 row in set (0.00 sec)


  1. CREATE TABLE atimestamp (id INT PRIMARY KEY AUTO_INCREMENT, justnow DATETIME, created TIMESTAMP);


mysql> describe atimestamp;
+---------+-----------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+-----------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| justnow | datetime | YES | | NULL | |
| created | timestamp | NO | | CURRENT_TIMESTAMP | |
+---------+-----------+------+-----+-------------------+----------------+
3 rows in set (0.00 sec)


So here we have the table, along with a datetime column that we will set to the magic value of NOW().

mysql> insert into atimestamp (justnow) values (NOW());
Query OK, 1 row affected (0.00 sec)

mysql> select * from atimestamp;
+----+---------------------+---------------------+
| id | justnow | created |
+----+---------------------+---------------------+
| 1 | 2009-04-23 14:29:58 | 2009-04-23 14:29:58 |
+----+---------------------+---------------------+
1 row in set (0.00 sec)


Just as one would hope, MySQL automatically sets the value of TIMESTAMP column I named "created" to be the same as the Server time. So if our primary goal is to have a column that keeps track of when the row was created, our TIMESTAMP does a great job.

Or does it?


Stopping an UPDATE from overwriting the TIMESTAMP



What happens if, at later time, we UPDATE a column in the table?

mysql> update atimestamp set justnow = NOW();
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

mysql> select * from atimestamp;
+----+---------------------+---------------------+
| id | justnow | created |
+----+---------------------+---------------------+
| 1 | 2009-04-23 14:30:32 | 2009-04-23 14:30:32 |
+----+---------------------+---------------------+
1 row in set (0.00 sec)


This is not good. Our created time has been lost, because the default behavior of a TIMESTAMP is to update the value to NOW(), any time the row is changed.

The TIMESTAMP Defaults


Setting a column to be a MySQL TIMESTAMP is equivalent to also giving the column a default of CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP. For those not familiar with Mysql Defaults, when creating a table, you can specify a default value for a column to receive if it is not specifically assigned a value during an INSERT. You do this using the DEFAULT keyword, and for all columns other than a TIMESTAMP this must be a constant value. Timestamp columns can specify the CURRENT_TIMESTAMP default which tells mySQL to default this column to NOW().

Stopping UPDATE on TIMESTAMP


The only way not to get the update behavior is to specifically declare the TIMESTAMP to DEFAULT CURRENT_TIMESTAMP. This seems to me to defeat the purpose of having the default behavior in the first place but might be understandable if it was possible for you to have a second TIMESTAMP column, perhaps named "updated".

Unfortunately, you can't have one TIMESTAMP with DEFAULT CURRENT_TIMESTAMP, and a second one with ON UPDATE CURRENT_TIMESTAMP. More often than not, what people really want is only the DEFAULT CURRENT_TIMESTAMP behavior, so it's very important to remember this workaround, if you are using a TIMESTAMP as for example, the "signup date" in a User table. See this in action below:

  1. CREATE TABLE atimestamp (id INT PRIMARY KEY AUTO_INCREMENT, justnow DATETIME, created TIMESTAMP DEFAULT CURRENT_TIMESTAMP);


mysql> describe atimestamp;
+---------+-----------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+-----------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| justnow | datetime | YES | | NULL | |
| created | timestamp | NO | | CURRENT_TIMESTAMP | |
+---------+-----------+------+-----+-------------------+----------------+
3 rows in set (0.00 sec)

mysql> insert into atimestamp (justnow) values (NOW());
Query OK, 1 row affected (0.00 sec)

mysql> select * from atimestamp;
+----+---------------------+---------------------+
| id | justnow | created |
+----+---------------------+---------------------+
| 1 | 2009-04-23 18:15:38 | 2009-04-23 18:15:38 |
+----+---------------------+---------------------+
1 row in set (0.00 sec)

mysql> update atimestamp set justnow = NOW();
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

mysql> select * from atimestamp;
+----+---------------------+---------------------+
| id | justnow | created |
+----+---------------------+---------------------+
| 1 | 2009-04-23 18:15:56 | 2009-04-23 18:15:38 |
+----+---------------------+---------------------+
1 row in set (0.00 sec)



Now we have automatic timestamping on INSERT, but without the timestamp being overwritten on UPDATE.

What about an UPDATE only TIMESTAMP?


Perhaps there's a really good reason to dedicate a TIMESTAMP to only apply the time when the row is updated. I can't think of one, but for the sake of completeness, here's how you can define the Default.

  1. CREATE TABLE atimestamp (id INT PRIMARY KEY AUTO_INCREMENT, justnow DATETIME, updated TIMESTAMP DEFAULT 0 ON UPDATE CURRENT_TIMESTAMP);


mysql> describe atimestamp;
+---------+-----------+------+-----+---------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+-----------+------+-----+---------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| justnow | datetime | YES | | NULL | |
| updated | timestamp | NO | | 0000-00-00 00:00:00 | |
+---------+-----------+------+-----+---------------------+----------------+
3 rows in set (0.00 sec)

mysql> insert into atimestamp (justnow) values (NOW());
Query OK, 1 row affected (0.00 sec)

mysql> select * from atimestamp;
+----+---------------------+---------------------+
| id | justnow | updated |
+----+---------------------+---------------------+
| 1 | 2009-04-24 23:33:56 | 0000-00-00 00:00:00 |
+----+---------------------+---------------------+
1 row in set (0.01 sec)

mysql> update atimestamp set justnow = NOW();
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

mysql> select * from atimestamp;
+----+---------------------+---------------------+
| id | justnow | updated |
+----+---------------------+---------------------+
| 1 | 2009-04-24 23:34:19 | 2009-04-24 23:34:19 |
+----+---------------------+---------------------+
1 row in set (0.00 sec)



Many Timestamps, but only one that's magical


MySQL will allow you to declare multiple columns of type TIMESTAMP, but only the first timestamp in the table will have the built in Default.

  1. CREATE TABLE bigtimestamp (id INT PRIMARY KEY AUTO_INCREMENT, justnow DATETIME, created TIMESTAMP, updated TIMESTAMP, verified TIMESTAMP);


mysql> describe bigtimestamp;
+----------+-----------+------+-----+---------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+-----------+------+-----+---------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| justnow | datetime | YES | | NULL | |
| created | timestamp | NO | | CURRENT_TIMESTAMP | |
| updated | timestamp | NO | | 0000-00-00 00:00:00 | |
| verified | timestamp | NO | | 0000-00-00 00:00:00 | |
+----------+-----------+------+-----+---------------------+----------------+
5 rows in set (0.00 sec)

mysql> insert into bigtimestamp (justnow) values (NOW());
Query OK, 1 row affected (0.01 sec)

mysql> select * from bigtimestamp;
+----+---------------------+---------------------+---------------------+---------------------+
| id | justnow | created | updated | verified |
+----+---------------------+---------------------+---------------------+---------------------+
| 1 | 2009-04-24 23:39:19 | 2009-04-24 23:39:19 | 0000-00-00 00:00:00 | 0000-00-00 00:00:00 |
+----+---------------------+---------------------+---------------------+---------------------+
1 row in set (0.00 sec)


Even with this limitation, it may still be a good idea to declare your Date columns as TIMESTAMP type, as the savings of 4 bytes per date per row can easily be significant if your table will have a lot of rows in it.

Single table Insert and Update Timestamp Workaround


It is actually possible to trick mysql into providing the default timestamp behavior for "created' and "updated" columns, despite the documented limitation of only having one default timestamp per table. If you define the first timestamp column to have a default of 0, while also being NOT NULL, you can trick mysql into supplying today's date by explicitly setting the column to be NULL on insert. In this case, mySQL decides to be helpful and for no logical reason, to set the value to NOW(). The second TIMESTAMP which has the explict TIMESTAMP defaults, works normally, and is set to NOW() on insert and update.

  1. CREATE TABLE atimestamp (id INT PRIMARY KEY AUTO_INCREMENT, justnow DATETIME, created TIMESTAMP NOT NULL DEFAULT 0, updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP);


mysql> describe atimestamp;
+---------+-----------+------+-----+---------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+-----------+------+-----+---------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| justnow | datetime | YES | | NULL | |
| created | timestamp | NO | | 0000-00-00 00:00:00 | |
| updated | timestamp | NO | | CURRENT_TIMESTAMP | |
+---------+-----------+------+-----+---------------------+----------------+
4 rows in set (0.00 sec)

mysql> insert into atimestamp (justnow) values (NOW());
Query OK, 1 row affected (0.00 sec)

mysql> select * from atimestamp;
+----+---------------------+---------------------+---------------------+
| id | justnow | created | updated |
+----+---------------------+---------------------+---------------------+
| 1 | 2009-04-24 23:54:23 | 0000-00-00 00:00:00 | 2009-04-24 23:54:23 |
+----+---------------------+---------------------+---------------------+
1 row in set (0.00 sec)

mysql> insert into atimestamp (created, justnow) values (NULL, NOW());
Query OK, 1 row affected (0.00 sec)

mysql> select * from atimestamp;
+----+---------------------+---------------------+---------------------+
| id | justnow | created | updated |
+----+---------------------+---------------------+---------------------+
| 1 | 2009-04-24 23:54:23 | 0000-00-00 00:00:00 | 2009-04-24 23:54:23 |
| 2 | 2009-04-24 23:55:35 | 2009-04-24 23:55:35 | 2009-04-24 23:55:35 |
+----+---------------------+---------------------+---------------------+
2 rows in set (0.00 sec)

mysql> update atimestamp set justnow=NOW() WHERE id = 2;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

mysql> select * from atimestamp;
+----+---------------------+---------------------+---------------------+
| id | justnow | created | updated |
+----+---------------------+---------------------+---------------------+
| 1 | 2009-04-24 23:54:23 | 0000-00-00 00:00:00 | 2009-04-24 23:54:23 |
| 2 | 2009-04-24 23:56:48 | 2009-04-24 23:55:35 | 2009-04-24 23:56:48 |
+----+---------------------+---------------------+---------------------+
2 rows in set (0.00 sec)


While this technique works at present, I would consider it an undocumented hack, based on a side effect. It also requires you to explicitly specify NULL in the values for the created column. If you forget to reference the column in the INSERT statement, it will not set the value to NOW() as illustrated above. This technique works for now, but who knows what will happen in future versions of mySQL? Use at your own risk.