The Rise of I.T. as a Service Broker

Just a quick blog post today in the run up to Christmas week and I thought I’d briefly summarize some of the things I have been dealing with recently and also touch on the role of the I.T. department as we move boldly into a cloudy world.

We have seen I.T. move through the virtualization phase to deliver greater efficiency and some have moved on to the Cloud phase to deliver more automation, elasticity and metering. Cloud can be private, public, community or hybrid, so Cloud does not necessarily imply an external service provider.

Iterative Right-Sizing

One of the things that has become clear is the need for right-sizing as part of any move to an external provider. External provision has a low base cost and a high metered cost, so you get best value by making sure your allowances for CPU, RAM and disk are a reasonably tight fit with your actual requirements, and relying on service elasticity to expand as needed. The traditional approach of building a lot of advance headroom into everything will cost you dearly. You cannot expect an external provider to deliver “your mess for less” and in fact what you will get if you don’t right size is “your mess for more”.

Right-Locating

And it’s not necessarily true that all of your services are best met by the one or two tiers that a single Cloud provider offers. This is where the Hybrid Cloud comes in, and more than that, this is where a Cloud Management Platform (CMP) function comes in.

“Any substantive cloud strategy will ultimately require using multiple cloud services from different providers, and a combination of both internal and external cloud.” Gartner, September 2013, (Hybrid Cloud Is Driving the Shift From Control to Coordination).

A CMP such as VMware’s vRealize Automation, RightScale, or Scalr can actually take you one step further than a simple Hybrid Cloud. A CMP can allow you to right-locate your services in a policy-driven and centrally managed way. This might mean keeping some services in-house, some in an enterprise I.T. focused Cloud with a high level of performance and wrap-around services, and some in a race-to-the-bottom Public Cloud focused primarily on price.

5yrTCO

Some organisations are indeed consuming multiple services from multiple providers, but very few are managing this in a co-ordinated policy-driven manner. The kinds of problems than can arise are:

  • Offshore Public Cloud instances may be started up for temporary use and then forgotten rather than turned off, incurring unnecessary cost.
  • Important SQL database services might be running on a low cost IaaS with database administration duties neglected, creating unnecessary risk.
  • Low value test systems might be running on a high-service, high-performance enterprise cloud service, incurring unnecessary cost.

I.T. as a Service Broker

This layer of policy and management has a natural home with the I.T. department, but as an enabler for enterprise-wide in-policy consumption rather than as an obstacle.

ITaaSB

With the Service Brokering Capability, I.T. becomes the central point of control, provision, self-service and integration for all IT services regardless of whether they are sourced internally or externally. This allows an organisation to mitigate the risks and take the opportunities associated with Cloud.

Xmas

I will be enjoying the Christmas break and extending that well into January as is traditional in this part of the world where Christmas coincides with the start of Summer.

Happy holidays to all.

What is Cloud Computing?

LarryI remember being entertained by Larry Ellison’s Cloud Computing rant back in 2009 in which he pointed out that cloud was really just processors and memory and operating systems and databases and storage and the internet. While Larry was making a valid point, and he also made a point about IT being a fashion-driven industry, the positive goals of Cloud Computing should by now be much clearer to everyone.

When we talk about Cloud Computing it’s probably important that we try to work from a common understanding of what Cloud is and what the terms mean, and that’s where NIST comes in.

The National Institute of Standards and Technology (NIST) is an agency of the US Department of Commerce. In 2011, two years after Larry Ellison’s outburst, and after many drafts and years of research and discussion, NIST published their ‘Cloud Computing Definition’ stating:

“The definition is intended to serve as a means for broad comparisons of cloud services and deployment strategies, and to provide a baseline for discussion from what is cloud computing to how to best use cloud computing”.

“When agencies or companies use this definition they have a tool to determine the extent to which the information technology implementations they are considering meet the cloud characteristics and models. This is important because by adopting an authentic cloud, they are more likely to reap the promised benefits of cloud—cost savings, energy savings, rapid deployment and customer empowerment.”

The definition lists the five essential characteristics, the three service models and the four deployment models. I have summarized them in this blog post so as to do my small bit in encouraging the adoption of this definition as widely as possible to give us a common language and measuring stick for assessing the value of Cloud Computing.NIST layers

The Five essential characteristics

  1. On-demand self-service.
    • A consumer can unilaterally provision computing capabilities without requiring human interaction with the service provider.
  2. Broad network access.
    • Support for a variety of client platforms including mobile phones, tablets, laptops, and workstations.
  3. Resource pooling.
    • The provider’s computing resources are pooled under a multi-tenant model, with physical and virtual resources dynamically assigned according to demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter).
  4. Rapid elasticity.
    • Capabilities can be elastically provisioned and released commensurate with demand. Scaling is rapid and can appear to be unlimited.
  5. Metering.
    • Service usage (e.g., storage, processing, bandwidth, active user accounts) can be monitored, controlled, and reported, providing transparency for both the provider and consumer of the service

The Three service models

  1. Software as a Service (SaaS).
    • The consumer uses the provider’s applications, accessible from client devices through either a thin client interface, such as a web browser (e.g., web-based email), or a program interface. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user- specific application configuration settings.
  1. Platform as a Service (PaaS).
    • The consumer deploys consumer-created or acquired applications created using programming languages, libraries, services, and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly configuration settings for the application-hosting environment.
  1. Infrastructure as a Service (IaaS).
    • Provisioning processing, storage, networks etc, where the consumer can run a range of operating systems and applications. The consumer does not manage the underlying infrastructure but has control over operating systems, storage, and deployed applications and possibly limited control of networking (e.g., host firewalls).

Note that NIST has resisted the urge to go on to define additional services such as Backup as a Service (BaaS), Desktop as a Service (DaaS), Disaster Recovery as a Service (DRaaS) etc, arguing that these are already covered in one way or another by  the three ‘standard’ service models. This does lead to an interesting situation where one vendor will offer DRaaS or BaaS effectively as an IaaS offering, and another will offer it under more of a SaaS or PaaS model.

The Four Deployment Models

  1. Private cloud.
    • The cloud infrastructure is provisioned for exclusive use by a single organization comprising multiple consumers (e.g., business units). It may be owned, managed, and operated by the organization, a third party, or some combination of them, and it may exist on or off premises.
  1. Community cloud.
    • The cloud infrastructure is provisioned for exclusive use by a specific community of consumers from organizations that have shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be owned, managed, and operated by one or more of the organizations in the community, a third party, or some combination of them, and it may exist on or off premises.
  1. Public cloud.
    • The cloud infrastructure is provisioned for open use by the general public. It exists on the premises of the cloud provider.
  1. Hybrid cloud.
    • The cloud infrastructure is a composition of two or more distinct cloud infrastructures (private, community, or public) that remain unique entities, but are connected to enable data and application portability (e.g., cloud bursting for load balancing between clouds).

The NIST reference architecture also talks about the importance of the brokering function, which allows you to seamlessly deploy across a range of internal and external resources according to the policies you have set (e.g. cost, performance, sovereignty, security).

The NIST definition of Cloud Computing is the one adopted by ViFX and it is the simplest, clearest and best-researched definition of Cloud Computing I have come across.

2014 Update

On 22nd October 2014 NIST published a new document “US Government Cloud Computing Technology Roadmap” in two volumes which identifies ten high priority requirements for Cloud Computing adoption across the five areas of:

  • Security
  • Interoperability
  • Portability
  • Performance
  • Accessibility

The purpose of the document is to provide a cloud roadmap for US Government agencies highlighting ten high priority requirements to ensure that the benefits of cloud computing can be realized. Requirements seven and eight are particular to the US-Government but the others are generally applicable. My interpretation of NIST’s ten requirements is as follows:

  1. Standards-based products, processes, and services are essential to ensure that:
    • Technology investments do not become prematurely obsolete
    • Agencies can easily change cloud service providers
    • Agencies can economically acquire or develop private clouds
  2. Security technology solutions must be able to accommodate a wide range of business rules.
  3. Service-Level Agreements for performance and reliability should be clearly defined and enforceable.
  4. Multi-vendor consistent descriptions are required to make it easier for agencies to compare apples to apples.
  5. Federation in a community cloud environment needs more mature mechanisms to enable mutual sharing of resources.
  6. Data location and sovereignty policies are required so as to avoid technology limits becoming the de facto drivers of policy.
  7. US Federal Government requires special solutions that are not currently available from commercial cloud services.
  8. US Federal Government requires nation-scale non-proprietary technology including high security and emergency systems.
  9. High availability design goals, best practices, measurement and reporting is required to avoid catastrophic failures.
  10. Metrics need to be standardized so services can be sized and consumed with a high degree of predictability.

These are all worthwhile requirements, and there’s also a loopback here to some of Larry Ellison’s comments. Larry spoke about seeing value in rental arrangements, but also touched on the importance of innovation. NIST is trying to standardize and level the playing field to maximize value for consumers, but history tells us that vendors will try to innovate to differentiate themselves. For example, with the launch of VMware’s vCloud Air we are seeing the dominant player in infrastructure management software today staking its claim to establish itself as the de facto software standard for hybrid cloud. But that is really a topic for another day…

Standards

Storage Spaghetti Anyone?

I recall Tom West (Chief Scientist at Data General, and star of Soul of a New Machine) once saying to me when he visited New Zealand that there was an old saying “Hardware lasts three years, Operating Systems last 20 years, but applications can go on forever.”

Over the years I have known many application developers and several development managers, and one thing that they seem to agree on is that it is almost impossible to maintain good code structure inside an app over a period of many years. The pressures of deadlines for features, changes in market, fashion and the way people use applications, the occasional weak programmer, and the occasional weak dev manager, or temporary lapse in discipline due to other pressures all contribute to fragmentation over time. It is generally by this slow attrition that apps end up being full of structural compromises and the occasional corner that is complete spaghetti.

I am sure there are exceptions, and there can be periodic rebuilds that improve things, but rebuilds are expensive.

If I think about the OS layer, I recall Data General rebuilding much of their DG/UX UNIX kernel to make it more structured because they considered the System V code to be pretty loose. Similarly IBM rebuilt UNIX into a more structured AIX kernel around the same time, and Digital UNIX (OSF/1) was also a rebuild based on Mach. Ironically HPUX eventually won out over Digital UNIX after the merger, with HPUX rumoured to be the much less structured product, a choice that I’m told has slowed a lot of ongoing development. Microsoft rebuilt Windows as NT and Apple rebuilt Mac OS to base it on the Mach kernel.

So where am I heading with this?

Well I have discussed this topic with a couple of people in recent times in relation to storage operating systems. If I line up some storage OS’s and their approximate date of original release you’ll see what I mean:

Netapp Data ONTAP 1992 22 years
EMC VNX / CLARiiON 1993 21 years
IBM DS8000 (assuming ESS code base) 1999 15 years
HP 3PAR 2002 12 years
IBM Storwize 2003 11 years
IBM XIV / Nextra 2006 8 years
Nimble Storage 2010 4 years

I’m not trying to suggest that this is a line-up in reverse order of quality, and no doubt some vendors might claim rebuilds or superb structural discipline, but knowing what I know about software development, the age of the original code is certainly a point of interest.

With the current market disruption in storage, cost pressures are bound to take their toll on development quality, and the problem is amplified if vendors try to save money by out-sourcing development to non-integrated teams in low-cost countries (e.g. build your GUI in Romania, or your iSCSI module in India).

Spaghetti

Decoupling Storage Performance from Capacity

SplitDecoupling storage performance from storage capacity is an interesting concept that has gained extra attention in recent times. Decoupling is predicated on a desire to scale performance when you need performance and to scale capacity when you need capacity, rather than traditional spindle-based scaling delivering both performance and capacity.

Also relevant is the idea that today’s legacy disk systems are holding back app performance. For example, VMware apparently claimed that 70% of all app performance support calls were caused by external disk systems.

The Business Value of Storage Performance

IT operations have spent the last 10 years trying to keep up with capacity growth, with less focus on performance growth. The advent of flash has however shown that even though you might not have a pressing storage performance problem, if you add flash your whole app environment will generally run faster and that can mean business advantages ranging from better customer experiences to more accurate business decision making.

A Better Customer Experience

My favorite example of performance affecting customer experience is from my past dealings with an ISP of whom I was a residential customer. I was talking to a call centre operator who explained to me that ‘the computer was slow’ and that it would take a while to pull up the information I was seeking. We chatted as he slowly navigated the system, and as we waited, one of the things he was keen to chat about was how much he disliked working for that ISP   : o

I have previously referenced a mobile phone company in the US who replaced all of their call centre storage with flash, specifically so as to deliver a better customer experience. The challenge with that is cost. The CIO was quoted as saying that the cost to go all flash was not much more per TB than he had paid for tier1 storage in the previous buying cycle (i.e. 3 or maybe 5 years earlier). So effectively he was conceding that he was paying more per TB for tier1 storage now than he was some years ago. Because the environment deployed did not decouple performance from capacity however, that company has almost certainly significantly over-provisioned storage performance, hence the cost per TB being higher than on the last buying cycle.

More Accurate Business Decision Making

There are many examples of storage performance improvements leading to better business decisions, most typically in the area of data warehousing. When business intelligence reports have more up to date data in them, and they run more quickly, they are used more often and decisions are more likely to be evidence-based rather than based on intuition. I recall one CIO telling me about a meeting of the executive leadership team of his company some years ago where each exec was asked to write down the name of the company’s largest supplier – and each wrote a different name – illustrating the risk of making decisions based on intuition rather than on evidence/business intelligence.

Decoupling Old School Style

Of course we have always been able to decouple performance and capacity to some extent, and it was traditionally called tiering. You could run your databases on small fast drives RAID10 and your less demanding storage on larger drives with RAID5 or RAID6. What that didn’t necessarily give you was a lot of flexibility.

Products like IBM’s SAN Volume Controller introduced flexibility to move volumes around between tiers in real-time, and more recently VMware’s Storage vMotion has provided a sub-set of the same functionality.

And then sub-lun tiering (Automatic Data Relocation, Easy Tier, FAST, etc) reduced the need for volume migration as a means of managing performance, by automatically promoting hot chunks to flash, and dropping cooler chunks to slower disks. You could decouple performance from capacity somewhat by choosing your flash to disk ratio appropriately, but you still typically had to be careful with these solutions since the performance of, for example, random writes that do not go to flash would be heavily dependent on the disk spindle count and speed.

So for the most part, decoupling storage performance and capacity in an existing disk system has been about adding flash and trying not to hit internal bottlenecks.

Traditional random I/O performance is therefore a function of:

  1. the amount/percent of flash cf the data block working set size
  2. the number and speed of disk spindles
  3. bus and cache (and sometimes CPU) limitations

Two products that bring their own twists to the game:

Nimble Storage

CASL

Nimble Storage uses flash to accelerate random reads, and accelerates writes through compression into sequential 4.5MB stripes (compare this to IBM’s Storwize RtC which compresses into 32K chunks and you can see that what Nimble is doing is a little different).

Nimble performance is therefore primarily a function of

  1. the amount of flash (read cache)
  2. the CPU available to do the compression/write coalescing

The number of spindles is not quite so important when you’re writing 4.5MB stripes. Nimble systems generally support at least 190 TB nett (if I assume 1.5x compression average, or 254 TB if you expect 2x) from 57 disks and they claim that performance is pretty much decoupled from disk space since you will generally hit the wall on flash and CPU before you hit the wall on sequential writes to disk. Also this kind of decoupling allows you to get good performance and capacity in a very small amount of rack space. Nimble also offers CPU scaling in the form of a scale-out four-way cluster.

Nimble have come closer to decoupling performance and capacity than any other external storage vendor I have seen.

PernixData FVPPernixData

PernixData Flash Virtualization Platform (FVP) is a software solution designed to build a flash read/write cache inside a VMware ESXi cluster, thereby accelerating I/Os without needing to add anything to your external disk system. PernixData argue that it is more cost effective and efficient to add flash into the ESXi hosts than it is to add them into external storage systems. This has something in common with the current trend for converged scale-out server/storage solutions, but PernixData also works with existing external SAN environments.

There is criticism that flash technologies deployed in external storage are too far away from the app to be efficient. I recall Amit Dave (IBM Distinguished Engineer) recounting an analogy of I/O to eating, for which I have created my own version below:

  • Data in the CPU cache is like food in your spoon
  • Data in the server RAM is like food on your plate
  • Data in the shared Disk System cache is like food in the serving bowl in the kitchen
  • Data on the shared Disk System SSDs is like food you can get from your garden
  • Data on hard disks is like food in the supermarket down the road

PernixData works by keeping your data closer to the CPU – decoupling performance and capacity by focusing on a server-side caching layer and scaling alongside your compute ESXi cluster. So this is analagous to getting food from your table rather than food from your garden. With PernixData you tend to scale performance as you add more compute nodes, rather than when you add more back-end capacity.

To Decouple or not to Decouple?

Decoupling as a theoretical concept is surely a good thing – independent scaling in two dimensions – and it is especially nice if it can be done without introducing significant extra cost, complexity or management overhead.

It is however probably also fair to say that many other systems can approximate the effect, albeit with a little more complexity.

———————————————————————————————————-

Disclosures:

Jim Kelly holds PernixPrime accreditation from PernixData and is a certified Nimble Storage Sales Professional. ViFX is a reseller of both Nimble Storage and PernixData.

How well do you know your scale-out storage architectures?

The clustered/scale-out storage world keeps getting more and more interesting and for some they would say more and more confusing.

There are too many to list them all here, but here are block diagrams depicting seven interesting storage or converged hardware architectures. See if you can decipher my diagrams and match the labels by choosing between the three sets of options in the multi-choice poll at the bottom of the page:

 

A VMware EVO: RACK
B IBM XIV
C VMware EVO: RAIL
D Nutanix
E Nimble
F IBM GPFS Storage Server (GSS)
G VMware Virtual SAN

 

Clusters3

 

A VMware EVO: RACK
B IBM XIV
C VMware EVO: RAIL
D Nutanix
E Nimble
F IBM GPFS Storage Server (GSS)
G VMware Virtual SAN

 

You can read more on VMware’s EVO:RAIL here.

Hypervisor / Storage Convergence

This is simply a re-blogging of an interesting discussion by James Knapp at http://www.vifx.co.nz/testing-the-hyper-convergence-waters/ looking at VMware Virtual SAN. Even more interesting than the blog post however is the whitepaper “How hypervisor convergence is reinventing storage for the pay-as-you-grow era” which ViFX has come up with as a contribution to the debate/discussion around Hypervisor storage.

I would recommend going to the first link for a quick read of what James has to say and then downloading the whitepaper from there for a more detailed view of the technology.

 

 

IBM Software-defined Storage

The phrase ‘Software-defined Storage’ (SDS) has quickly become one of the most widely used marketing buzz terms in storage. It seems to have originated from Nicira’s use of the term ‘Software-defined Networking’ and then adopted by VMware when they bought Nicira in 2012, where it evolved to become the ‘Software-defined Data Center’ including ‘Software-defined Storage’. VMware’s VSAN technology therefore has the top of mind position when we are talking about SDS. I really wish they’d called it something other than VSAN though, so as to avoid the clash with the ANSI T.11 VSAN standard developed by Cisco.

I have seen IBM regularly use the term ‘Software-defined Storage’ to refer to:

  1. GPFS
  2. Storwize family (which would include FlashSystem V840)
  3. Virtual Storage Center / Tivoli Storage Productivity Center

I recently saw someone at IBM referring to FlashSystem 840 as SDS even though to my mind it is very much a hardware/firmware-defined ultra-low-latency system with a very thin layer of software so as to avoid adding latency.

Interestingly, IBM does not seem to market XIV as SDS, even though it is clearly a software solution running on commodity hardware that has been ‘applianced’ so as to maintain reliability and supportability.

Let’s take a quick look at the contenders:

1. GPFS: GPFS is a file system with a lot of storage features built in or added-on, including de-clustered RAID, policy-based file tiering, snapshots, block replication, support for NAS protocols, WAN caching, continuous data protection, single namespace clustering, HSM integration, TSM backup integration, and even a nice new GUI. GPFS is the current basis for IBM’s NAS products (SONAS and V7000U) as well as the GSS (gpfs storage server) which is currently targeted at HPC markets but I suspect is likely to re-emerge as a more broadly targeted product in 2015. I get the impression that gpfs may well be the basis of IBM’s SDS strategy going forward.

2. Storwize: The Storwize family is derived from IBM’s SAN Volume Controller technology and it has always been a software-defined product, but tightly integrated to hardware so as to control reliability and supportability. In the Storwize V7000U we see the coming together of Storwize and gpfs, and at some point IBM will need to make the call whether to stay with the DS8000-derived RAID that is in Storwize currently, or move to the gpfs-based de-clustered RAID. I’d be very surprised if gpfs hasn’t already won that long-term strategy argument.

3. Virtual Storage Center: The next contender in the great SDS shootout is IBM’s Virtual Storage Center and it’s sub-component Tivoli Storage Productivity Center. Within some parts of IBM, VSC is talked about as the key to SDS. VSC is edition dependent but usually includes the SAN Volume Controller / Storwize code developed by IBM Systems and Technology Group, as well as the TPC and FlashCopy Manager code developed by IBM Software Group, plus some additional TPC analytics and automation. VSC gives you a tremendous amount of functionality to manage a large complex site but it requires real commitment to secure that value. I think of VSC and XIV as the polar opposites of IBM’s storage product line, even though some will suggest you do both. XIV drives out complexity based on a kind of 80/20 rule and VSC is designed to let you manage and automate a complex environment.

Commodity Hardware: Many proponents of SDS will claim that it’s not really SDS unless it runs on pretty much any commodity server. GPFS and VSC qualify by this definition, but Storwize does not, unless you count the fact that SVC nodes are x3650 or x3550 servers. However, we are already seeing the rise of certified VMware VSAN-ready nodes as a way to control reliability and supportability, so perhaps we are heading for a happy medium between the two extremes of a traditional HCL menu and a fully buttoned down appliance.

Product Strategy: While IBM has been pretty clear in defining its focus markets – Cloud, Analytics, Mobile, Social, Security (the ‘CAMSS’ message that is repeatedly referred to inside IBM) I think it has been somewhat less clear in articulating a clear and consistent storage strategy, and I am finding that as the storage market matures, smart people are increasingly wanting to know what the vendors’ strategies are. I say vendors plural because I see the same lack of strategic clarity when I look at EMC and HP for example. That’s not to say the products aren’t good, or the roadmaps are wrong, but just that the long-term strategy is either not well defined or not clearly articulated.

It’s easier for new players and niche players of course, and VMware’s Software-defined Storage strategy, for example, is both well-defined and clearly articulated, which will inevitably make it a baseline for comparison with the strategies of the traditional storage vendors.

A/NZ STG Symposium: For the A/NZ audience, if you want to understand IBM’s SDS product strategy, the 2014 STG Tech Symposium in August is the perfect opportunity. Speakers include Sven Oehme from IBM Research who is deeply involved with gpfs development, Barry Whyte from IBM STG in Hursley who is deeply involved in Storwize development, and Dietmar Noll from IBM in Frankfurt who is deeply involved in the development of Virtual Storage Center.

Melbourne – August 19-22

Auckland – August 26-28

Follow

Get every new post delivered to your Inbox.

Join 159 other followers

%d bloggers like this: