Deep Technology – Data

Deep space refers to a view of space involving great distances, light years: a cosmic perspective.   In recent years, one may hear of deep time: a view of time across eons, a greatly extended perspective.  Along those lines, deep technology examines tech from a similar extended vista; here, the topic is data, placing in perspective, and sowing seeds for ideas to flower.

We are currently going through a time of transition of how we view and use data.  Recent decades have seen the ascendance of relational databases and their associated schemas, currently used for almost all of daily business systems and transactions, with SQL as a standard.  Hardware improvements have allowed relational db’s to grow bigger and perform faster, but some of their limitations have become major friction points as the size of applications has grown, exemplified by social media.  Large once meant gigabytes;  now we deal in terabytes, petabytes and exabytes.

The current requirement for a schema consistent across all records becomes a severe constraint with very large databases, requiring an outage of days to add another column.  For that reason, and other constraints in the case of very large db’s, comes the assertion that relational doesn’t really scale, and that other solutions are needed.

In the beginning there was simply data, not even stored.  Then came stored data, and as it grew, improved means to access it, using keys (key-value stores), and indices.

Index sequential files grew in popularity, used in multiple operating systems, and still provide reliable, high-performing solutions today; scalability is arguable.  In some cases, they became the underpinnings of a database system.  But any access was programmatic, not user friendly.  There was no standard language for queries or updates.  Analytic processing was rarely considered in the design; at best, an add-on.

Database systems rose and evolved, network and hierarchical at first, and then the dawn of the relational era, with SQL becoming a standard for queries, and access available to the non-programmer.

Relational db’s grew larger and faster, and some of the wrinkles in the fabric became more apparent.  Dealing with data  other than basic numeric and text was difficult.  Various types of binary data such as images, audio and video are stored as binary large objects (BLOBs), with no good conventions for access or update, or even provisions for more granular access within the object.  Storage of documents as an object presented similar problems.  The idea of a whole document as simply a field (column) within a record (row) and being able to have useful access does not fit well within conventional relational approaches.  This has given rise to document-oriented db solutions, from early contenders Lotus Notes and its contemporaries, to recent ones as MongoDB and CouchDB.  Mongo, by the way, is part of the technology underlying Twitter.

Possibly some of the solutions of email systems could be called document-oriented db’s, but the potential for exploring enhanced functionality is seldom realized.

The rise of XML and the need to store such information has furthered both document-oriented db’s and hybrid solutions.  One hybrid is to parse the XML into its components, creating a suitable sub-schema in some cases.  Yet this loses the flexibility XML can provide, with not every record having all fields defined.

Even just with numeric fields there have been issues, mostly associated with extended precision, most obvious in the case of decimals (referred to as floating point) and the accuracy of both storage and calculations, with dreaded rounding errors.  This isn’t just for scientific data, but in securities processing, foreign currency, and a surprising number of more mundane cases.

Geo-spatial data usage, with its precise location capability, has greatly grown in the past few years, perhaps due to increased use of GPS.  However, it seldom has well-integrated functionality in the db, and seems to escape SQL standards.  This type of data is used in marketing. Business analysis and many other use cases; perhaps it would be more widely exploited with improved supporting functionality.

There will be other new data types and usages arising in the years to come, and our data solutions need to provide for integrating them, with the means to create new insights from them functionality around them.

The marked increase in the size of data being handled and new solutions to associated problems has given rise to a new movement, NoSQL.  In the past year or so this has morphed from No SQL to Not Only SQL, seeking to leverage the widespread usage and knowledge of SQL, but in a non-relational context.

In addition to the document-oriented solutions discussed earlier, there are numerous others, and not unlike the days before relational became the norm, when proprietary solutions were common.  There is a lack of standards or consistency between products.  One can note the resurgence of key-value stores, such as Big Table (an ill chosen name which has nothing to do with relational tables and does not use SQL). A brief very incomplete list of such solutions and associated technologies :  CouchDB; MongoDb; BigTable; Cassandra; Hadoop (Apache) and MapReduce (Google); Project Voldemoort (underpinning LinkedIn); Riak; Redis, etc.

Looking towards the future, I can envision improved searching, with enhanced functionality, especially for non-alphameric data such as images.  Why not have a query involving an image, or some layered graphics pattern?  Think of a visual version of the SoundsLike function one can use on text data.

Why should we need to do the metadata extraction of something to inquire about it, instead of letting the query engine do that for us?  For example, for a fashion item, I’d like to be able to query based upon the item, not a list we have created of its attributes.  Copy and paste the item into a search box and let the search find something like that. Like the approach of faceted browsing, but implicit within the search.

We need the ability for semantic processing built-in, ontologies either generated or reference ontologies readily linked.  Data exists in a context, with meaning and insights more readily generated by those semantic connections and constructs. As the semantic web and use of metadata becomes more widespread, one can hope that this will create more integration for semantics and metadata in data solutions.

Data can be perceived as having a life cycle: most recent data is of highest value, with fastest access in its life needed, and therefore on the most expensive storage.  Less recent data moves to cheaper storage, perhaps read-only; often this might be in a data warehouse.  Still older data, less accessible, placed on even less costly storage; finally archived.  Current practice has this going through the elaborate process of Extract, Transform, Load (ETL), usually going from a relational db to a data warehouse (DW).  Note that some DW’s use a columnar way of storage, rather than by row, since most queries against the data tend to be column (field) oriented.

Data analytics and OLAP (online analytic processing, noted for its multi-dimensional analysis) is on the rise of late, a growing trend, fueled in part by the acquisition of smaller niche functional providers of this by major players (e.g., IBM, Oracle); as well as by the need for more information, better risk analysis, etc.  Nowadays more priority is given to questions as, What information or insights can be established about a customer, a transaction, or a processing location?

Solutions for this space still tend to be add-on, rather than well integrated.  Microsoft SQL Server and its analytic abilities have been a pleasant exception to the typical relational solution.

One is also cautioned against burdening a highly performing transaction oriented db with the potentially crippling overhead of OLAP, banging against large amounts of the data.  In some cases, the Big Data solutions of Hadoop, MapReduce and Cassandra are used against data stores outside the relational db, going against huge log files and data files outside any db, with data sharding and massively parallel processing (hundreds or thousands of servers), providing answers in astonishingly short turn-around.  But with this, we’re mostly back to solutions dependent upon programming instead of built-in functionality.

The ongoing pattern is to empower the end user, the data analyst, with built-in tools and standard languages.

Why should we need external tools such as Google Refine (originally Freebase Gridworks) for data cleansing? Cannot this be integrated into SQL?

Why not include common output functionality, such as dashboards, into the db systems?

I remain curious as to what new solutions will arise, and the challenges yet to be solved, problems not yet formed.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: