Rejuvenating the Infrastructure

The lack of agility of the infrastructure, especially the data center and systems programming/ administration is so obvious that the nascent DevOps movement has arisen to combat this perceived deficiency.  Much of that movement’s visible contribution has focused on automation and tools (e.g., Chef, Puppet, cfengine, Nagios) rather than on frameworks or processes.  Not surprising, given the talent for coding of many of the DevOps advocates.

There are other obstacles:

The Data Center values stability of the environment, and resists change; change is seen as a risk, not as any form of added value.  In general, perhaps especially for larger IT shops, other than changes due to bug fixes or security patches, the data center is isolated from positive aspects of application and environmental changes.  New metrics could provide a more flexible approach along the lines of increased Business Value (BV).  Some places, especially those along ITIL lines, may use a metric for risk as part of Change Control, but I have not heard of a corresponding one for enhanced BV.  That BV metric could be a 1 to 5 scale, or a set of numeric guidelines.

Much of the effort of the infrastructure is about process, with solutions well defined.  The challenges are more along the lines of:

–       Sheer scope of tasks, as the whole enterprise may be involved, with associated risk of that dimension.

–       Interdependencies

–       Complexity

–       Volume of work (especially with sharp peaks)

–       Lack of teams

For volume, automation and tools can help; one should also investigate frameworks like  Kanban, limiting the work in progress at any moment.

The lack of teams in infrastructure remains a source of concern for me.  Teams, behaving as teams, not just silos of individuals with common product-oriented expertise.  For example, a group of system administrators each with distinct separate sets of servers to manage, with no or little collaboration between members of the group, is not a team. The situation not only deprives the company and the participants of the benefits of a team approach, but reinforces the sense of isolation and endless repetition of upgrades and patches.

In my prior blog post on Skill-Centric Teams, a key point is that teams can be created across functional silos, basing them on the shared skills and challenges.  While there are distinct differences between a UNIX system administrator, the Windows counterpart, and their mainframe brethren, but there is so much in common, including, but not limited to:

–       Issues of change management

–       Communication with developers

–       Application development (AD) support methodologies or frameworks

Application development infrastructure support frameworks deserve a separate post.

 

Feeding the Elephant – The Release of Java 7

Java was born of the intelligence and generosity of Sun Microsystems; over the past 15 years, it has evolved, grown in features and put down roots in an incredibly large part of the technology landscape.  The generosity of Sun made Java open source (for the most part), furthering its spread.

New releases of Java occurred about every two years through 2006.  Sun was in trouble as a company, and Java languished, with no new release in 2008 or 2009.

Oracle bought Sun, with all of its assets, including Java.  Oracle already had some presence in the Java environment, due to their earlier acquisition of BEA, noted for its WebLogic app server, and for its high performance JVM, JRockit.

With the release of Java 7, the first since 2006, Oracle appears to be positioning Java for renewed life.  Note that Java 7 SE is still not available for the Mac, nor was a projected date for that mentioned.

There’s only a few noteworthy new features, such as the new I/O library (NIO2); better directory support; symbolic link support; features to take advantage of multi-core processors (Fork Join framework); etc.  It’s more of a new foundation for work to come, such as the merge of JRockit with Sun’s Hot Spot JVM.

Java benefits from a large talent base of developers, highly experienced, with many enthusiastic about the language despite the lack of enhancement in recent years.  At the local JavaSIG meeting, several developers expressed concern about Java over the longer term.  The language for some was not only mature, but starting to feel a bit dated.  It’s not clear to me how much of that is valid and objective, or based in emotional reaction that it is now Oracle in charge of Java’s future.

Oracle has smart product managers, and an energetic marketing arm.  They surely recognize that they need to win the confidence and trust of the development community and those making strategic decisions for technology.  This release of Java is a reasonable first step.  Next one should be a detailed road map, and timely delivery of the items on it.

The strategic question some ask, “Will Java offer the performance and features in 5 years that we need to stay competitive?”  The jury waits on Oracle to prove its commitment over the next two years.

Java for Oracle may wind up being like the story of the person who bought an elephant, and then was faced with the cost of caring for and feeding it.

 

Databases, Data Streams, and Quantum Mechanics

What does algorithmic trading, medical systems monitoring, web searches and clicks, and fraud detection have in common?  Part of the answer is streams of data.  Large amounts of data, sources continuously streaming it.

Traditional databases (DBMS) are oriented towards processing and storing transactions, representing discrete events.  The view one has is typically a table.  If one extends the transactions over sufficient time, a data warehouse approach might be used, with an eye towards data mining and data analytics, mostly reading (and seldom updating) the stored data for OLAP (online analytic processing) with its classic cube.  In both databases and data warehouses, tools (e.g., SQL and OLAP tools) are in the hands of skilled business analysts and data analysts, not just for programmers.

Classical physics in its approach to light treated it as waves, with different colors having different wave lengths.  Wave behavior allows for interference patterns, and coherence of light, with coherence epitomized by lasers.

Along came Einstein and his contemporaries with fresh insights, revealing that light can also exhibit particle behavior, discrete entities rather than the continuity of waves.  This and far more is explained by quantum physics.

Faster networks in recent years, and faster, larger storage of data, have brought us streaming data.  A stream of data; a continuous wave of it.  New paradigms are needed to fully cope with this, to facilitate new insights.  The tables of relational DBMS and cubes of OLAP only give us a partial view.  Perhaps other geometries or coordinate systems (polar or spherical instead of Cartesian) could provide other insights.

To maximize the effectiveness and value for business, the  new tools for these new insights need to be accessible to the same skilled people, business analysts and data analysts, not just to programmers.

New solutions are evolving, glimmerings of new paradigms; for example, IBM’s BigInsights.  For the curious, look at major vendors, with topics like “data streaming”, and “complex event processing”.

Deep Technology – Data

Deep space refers to a view of space involving great distances, light years: a cosmic perspective.   In recent years, one may hear of deep time: a view of time across eons, a greatly extended perspective.  Along those lines, deep technology examines tech from a similar extended vista; here, the topic is data, placing in perspective, and sowing seeds for ideas to flower.

We are currently going through a time of transition of how we view and use data.  Recent decades have seen the ascendance of relational databases and their associated schemas, currently used for almost all of daily business systems and transactions, with SQL as a standard.  Hardware improvements have allowed relational db’s to grow bigger and perform faster, but some of their limitations have become major friction points as the size of applications has grown, exemplified by social media.  Large once meant gigabytes;  now we deal in terabytes, petabytes and exabytes.

The current requirement for a schema consistent across all records becomes a severe constraint with very large databases, requiring an outage of days to add another column.  For that reason, and other constraints in the case of very large db’s, comes the assertion that relational doesn’t really scale, and that other solutions are needed.

In the beginning there was simply data, not even stored.  Then came stored data, and as it grew, improved means to access it, using keys (key-value stores), and indices.

Index sequential files grew in popularity, used in multiple operating systems, and still provide reliable, high-performing solutions today; scalability is arguable.  In some cases, they became the underpinnings of a database system.  But any access was programmatic, not user friendly.  There was no standard language for queries or updates.  Analytic processing was rarely considered in the design; at best, an add-on.

Database systems rose and evolved, network and hierarchical at first, and then the dawn of the relational era, with SQL becoming a standard for queries, and access available to the non-programmer.

Relational db’s grew larger and faster, and some of the wrinkles in the fabric became more apparent.  Dealing with data  other than basic numeric and text was difficult.  Various types of binary data such as images, audio and video are stored as binary large objects (BLOBs), with no good conventions for access or update, or even provisions for more granular access within the object.  Storage of documents as an object presented similar problems.  The idea of a whole document as simply a field (column) within a record (row) and being able to have useful access does not fit well within conventional relational approaches.  This has given rise to document-oriented db solutions, from early contenders Lotus Notes and its contemporaries, to recent ones as MongoDB and CouchDB.  Mongo, by the way, is part of the technology underlying Twitter.

Possibly some of the solutions of email systems could be called document-oriented db’s, but the potential for exploring enhanced functionality is seldom realized.

The rise of XML and the need to store such information has furthered both document-oriented db’s and hybrid solutions.  One hybrid is to parse the XML into its components, creating a suitable sub-schema in some cases.  Yet this loses the flexibility XML can provide, with not every record having all fields defined.

Even just with numeric fields there have been issues, mostly associated with extended precision, most obvious in the case of decimals (referred to as floating point) and the accuracy of both storage and calculations, with dreaded rounding errors.  This isn’t just for scientific data, but in securities processing, foreign currency, and a surprising number of more mundane cases.

Geo-spatial data usage, with its precise location capability, has greatly grown in the past few years, perhaps due to increased use of GPS.  However, it seldom has well-integrated functionality in the db, and seems to escape SQL standards.  This type of data is used in marketing. Business analysis and many other use cases; perhaps it would be more widely exploited with improved supporting functionality.

There will be other new data types and usages arising in the years to come, and our data solutions need to provide for integrating them, with the means to create new insights from them functionality around them.

The marked increase in the size of data being handled and new solutions to associated problems has given rise to a new movement, NoSQL.  In the past year or so this has morphed from No SQL to Not Only SQL, seeking to leverage the widespread usage and knowledge of SQL, but in a non-relational context.

In addition to the document-oriented solutions discussed earlier, there are numerous others, and not unlike the days before relational became the norm, when proprietary solutions were common.  There is a lack of standards or consistency between products.  One can note the resurgence of key-value stores, such as Big Table (an ill chosen name which has nothing to do with relational tables and does not use SQL). A brief very incomplete list of such solutions and associated technologies :  CouchDB; MongoDb; BigTable; Cassandra; Hadoop (Apache) and MapReduce (Google); Project Voldemoort (underpinning LinkedIn); Riak; Redis, etc.

Looking towards the future, I can envision improved searching, with enhanced functionality, especially for non-alphameric data such as images.  Why not have a query involving an image, or some layered graphics pattern?  Think of a visual version of the SoundsLike function one can use on text data.

Why should we need to do the metadata extraction of something to inquire about it, instead of letting the query engine do that for us?  For example, for a fashion item, I’d like to be able to query based upon the item, not a list we have created of its attributes.  Copy and paste the item into a search box and let the search find something like that. Like the approach of faceted browsing, but implicit within the search.

We need the ability for semantic processing built-in, ontologies either generated or reference ontologies readily linked.  Data exists in a context, with meaning and insights more readily generated by those semantic connections and constructs. As the semantic web and use of metadata becomes more widespread, one can hope that this will create more integration for semantics and metadata in data solutions.

Data can be perceived as having a life cycle: most recent data is of highest value, with fastest access in its life needed, and therefore on the most expensive storage.  Less recent data moves to cheaper storage, perhaps read-only; often this might be in a data warehouse.  Still older data, less accessible, placed on even less costly storage; finally archived.  Current practice has this going through the elaborate process of Extract, Transform, Load (ETL), usually going from a relational db to a data warehouse (DW).  Note that some DW’s use a columnar way of storage, rather than by row, since most queries against the data tend to be column (field) oriented.

Data analytics and OLAP (online analytic processing, noted for its multi-dimensional analysis) is on the rise of late, a growing trend, fueled in part by the acquisition of smaller niche functional providers of this by major players (e.g., IBM, Oracle); as well as by the need for more information, better risk analysis, etc.  Nowadays more priority is given to questions as, What information or insights can be established about a customer, a transaction, or a processing location?

Solutions for this space still tend to be add-on, rather than well integrated.  Microsoft SQL Server and its analytic abilities have been a pleasant exception to the typical relational solution.

One is also cautioned against burdening a highly performing transaction oriented db with the potentially crippling overhead of OLAP, banging against large amounts of the data.  In some cases, the Big Data solutions of Hadoop, MapReduce and Cassandra are used against data stores outside the relational db, going against huge log files and data files outside any db, with data sharding and massively parallel processing (hundreds or thousands of servers), providing answers in astonishingly short turn-around.  But with this, we’re mostly back to solutions dependent upon programming instead of built-in functionality.

The ongoing pattern is to empower the end user, the data analyst, with built-in tools and standard languages.

Why should we need external tools such as Google Refine (originally Freebase Gridworks) for data cleansing? Cannot this be integrated into SQL?

Why not include common output functionality, such as dashboards, into the db systems?

I remain curious as to what new solutions will arise, and the challenges yet to be solved, problems not yet formed.