<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cloudscaling &#187; hadoop</title>
	<atom:link href="http://www.cloudscaling.com/blog/tag/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.cloudscaling.com</link>
	<description>Open Cloud Solutions</description>
	<lastBuildDate>Wed, 09 May 2012 16:43:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Grid, Cloud, HPC &#8230; What&#8217;s the Diff?</title>
		<link>http://www.cloudscaling.com/blog/cloud-computing/grid-cloud-hpc-whats-the-diff/</link>
		<comments>http://www.cloudscaling.com/blog/cloud-computing/grid-cloud-hpc-whats-the-diff/#comments</comments>
		<pubDate>Thu, 18 Nov 2010 19:22:42 +0000</pubDate>
		<dc:creator>Randy Bias</dc:creator>
				<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[cloudscaling]]></category>
		<category><![CDATA[commoditization]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[grid]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hpc]]></category>
		<category><![CDATA[hsc]]></category>
		<category><![CDATA[iaas]]></category>
		<category><![CDATA[infrastructure]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[scalability]]></category>
		<category><![CDATA[scaling]]></category>

		<guid isPermaLink="false">http://cloudscaling.com/blog/?p=1517</guid>
		<description><![CDATA[It&#8217;s always nice when another piece of the puzzle comes into focus.  In this case, my time speaking at the first ever International Super Computer (ISC) Cloud Conference the week before last was well spent.  The conference was heavily attended &#8230; <a href="http://www.cloudscaling.com/blog/cloud-computing/grid-cloud-hpc-whats-the-diff/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s always nice when another piece of the puzzle comes into focus.  In this case, my time speaking at the first ever International Super Computer (ISC) Cloud Conference the week before last was well spent.  The conference was heavily attended by those out of the <a href="http://en.wikipedia.org/wiki/Grid_computing">grid computing</a> space and I learned a lot about both cloud and grid.  In particular, I think I finally understand what causes some to view grid as a pre-cursor to cloud while others view it as a different beast only tangentially related.</p>
<p>This really comes down to a particular TLA in use to describe grid: <a href="http://en.wikipedia.org/wiki/High-performance_computing">High Performance Computing</a> or HPC.  HPC and grid are commonly used interchangeably.  Cloud is not HPC, although now it can certainly support some HPC workloads, née <a href="http://aws.amazon.com/hpc-applications/">Amazon&#8217;s EC2 HPC offering</a>.  No, cloud is something a little bit different:  High Scalability Computing or simply HSC here.</p>
<p>Let me explain in some depth &#8230;</p>
<p><span id="more-1517"></span></p>
<p><strong>Scalability vs. Performance</strong><br />
First it&#8217;s critical for readers to understand the fundamental difference between <a href="http://en.wikipedia.org/wiki/Scalability">scalability</a> and <a href="http://en.wikipedia.org/wiki/Computer_performance">performance</a>.  While the two are frequently conflated, they are quite different.  Performance is the capability of particular component to provide a certain amount of capacity, throughput, or &#8216;yield&#8217;.  Scalability, in contrast, is about the ability of a system to expand to meet demand.  This is quite frequently measured by looking at the aggregate performance of the individual components of a particular system and how they function over time.</p>
<p>Put more simply, performance measures the capability of a single part of a large system while scalability measures the ability of a large system to grow to meet growing demand.<br />
Scalable systems may have individual parts that are relatively low performing.  I have heard that the Amazon.com retail website&#8217;s web servers went from 300 transactions per second (TPS) to a mere 3 TPS each after moving to a more scalable architecture.  The upside is that while every web server might have lower individual performance, the overall system became significantly more scalable and new web servers could be added ad infinitum.</p>
<p>High performing systems on the other hand focus on eking out every ounce of resource from a particular component, rather than focusing on the big picture.  One might have high performance systems in a very scalable system or not.</p>
<p>For most purposes, scalability and performance are orthogonal, but many either equate them or believe that one breeds the other.</p>
<p><strong>Grid &amp; High Performance Computing</strong><br />
The origins of HPC/Grid exist within the academic community where needs arose to crunch large data sets very early on.  Think satellite data, genomics, nuclear physics, etc.  Grid, effectively, has been around since the beginning of the enterprise computing era, when it became easier for academic research institutions to move away from large mainframe-style supercomputers (e.g. Cray, Sequent) towards a more scale-out model using lots of relatively inexpensive x86 hardware in large clusters.  The emphasis here on *relatively*.</p>
<p>Most x86 clusters today are built out for <a href="http://www.top500.org/">very high performance *and* scalability</a>, but with a particular focus on performance of individual components (servers) and the interconnect network for reasons that I will explain below.  The price/performance of the overall system is not as important as aggregate throughput of the entire system.  Most academic institutions build out a grid to the full budget they have attempting to eke out every ounce of performance in each component.</p>
<p>This is not the way that cloud pioneers such as Amazon.com and Google built their infrastructures.</p>
<p><strong>Cloud &amp; High Scalability Computing</strong><br />
Cloud, or HSC, by contrast, focuses on hitting the price/performance sweet spot, using truly commodity components and buying *lots* more of them.  This means building very large and scalable systems.</p>
<p>I was surprised at the ISC Cloud Conference when I heard one participant bragging about their cluster with 320,000 &#8216;cores&#8217;.  Amazon EC2 (sans the new HPC offering) is at roughly 500,000 cores, quite possibly more.  And Google is probably in the order of 10 million+ cores.  Clouds built around High Scalability Computing are an order of magnitude larger than most grid clusters and designed to handle generic workloads, requiring hitting the price/performance sweet spot when building them.</p>
<p>Grid workloads can be very, very different.</p>
<p><strong>Some Grid Workloads Drive the Grid Community</strong><br />
In talking to the grid community I learned that there are effectively two key types of problem that are solved on large scale computing clusters: MPI (<a href="http://en.wikipedia.org/wiki/Message_Passing_Interface">Message Passing Interface</a>) and &#8216;embarrassingly parallel&#8217; problems.  I&#8217;m using terms I heard at the conference, but will use MPI and EPP (embarrassingly parallel problem) so that I can shorthand throughout the rest of this article.</p>
<p>MPI is essentially a programming paradigm that allows for taking extremely large sets of data and crunching the information in parallel WHILE sharing the data between compute nodes. Some times this is also referred to as &#8216;clustering&#8217;, although that term is frequently overloaded today.  Certain kinds of problems necessitate this sharing as the computed results on one node may effect the computed results on another node in the grid.  MPI-based grids, the de facto standard for most academic institutions, are built to maximum throughput and performance per system, including the lowest latency possible.  Most of them use Infiniband technology for example to effectively turn the entire grid into a single &#8216;<a href="http://en.wikipedia.org/wiki/Supercomputer">supercomputer</a>&#8216;.  In fact, most of these MPI-based grids are ranked into the Supercomputer <a href="http://en.wikipedia.org/wiki/TOP500">Top500</a>.</p>
<p>An MPI grid/cluster, in many ways, looks more like an old school mainframe and technology such as Infiniband essentially turns the network into a high-speed bus, just like a PCI bus inside a typical x86 server.</p>
<p>EPP workloads, by contrast, have no data sharing requirements.  A very large dataset is chopped into pieces, distributed to a large pool of workers, and then the data is brought back and reassembled.  Does this sound familiar?  It should, it&#8217;s very similar to Google&#8217;s <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce</a> functionality and the open source tool, Hadoop.  EPP workloads are very commonly run on top of MPI clusters, although some academic institutions build out separate or smaller grids to run them instead.</p>
<p>The majority of grid workloads are of the EPP type.  The diagram below shows this.</p>
<p><a href="http://cloudscaling.com/wp-content/uploads/2010/11/hpc-vs-hsc-pyramid.png"><img class="aligncenter size-full wp-image-1518" title="hpc-vs-hsc-pyramid" src="http://cloudscaling.com/wp-content/uploads/2010/11/hpc-vs-hsc-pyramid.png" alt="" width="313" height="315" /></a></p>
<p>I had one person confide in me that &#8220;<em>MPI power users drive grid requirements for vendors and assume that if their problems are solved, then the problems of [EPP] users are solved.</em>&#8221;<br />
This is interesting since these two types of workloads have different needs.</p>
<p><strong>HPC vs. HSC</strong><br />
The reality is that High Scalability Computing is ideal for the majority of EPP grid workloads.  In fact, large amounts of this kind of work, in the form of MapReduce jobs have been running on Amazon EC2 since its beginning and have driven much of its growth.</p>
<p>HPC is a different beast altogether as many of the MPI workloads require very low latency and servers with individually high performance.  It turns out however, that all MPI workloads are not the same.  The lower bottom of the top part of that pyramid is filled with MPI workloads that require a great network, but not an Infiniband network:</p>
<p><a href="http://cloudscaling.com/wp-content/uploads/2010/11/hpc-vs-hsc-pyramid-mpi-high-latency.png"><img class="aligncenter size-full wp-image-1519" title="hpc-vs-hsc-pyramid-mpi-high-latency" src="http://cloudscaling.com/wp-content/uploads/2010/11/hpc-vs-hsc-pyramid-mpi-high-latency.png" alt="" width="500" height="284" /></a></p>
<p>In keeping with Amazon Web Service&#8217;s tendency to build out using commodity (cloud) techniques, their new HPC offering does not use Infiniband, but instead opts for 10Gig Ethernet.  This makes the network great, but not awesome and allows them to create a cloud service tailored for many HPC jobs.  In fact, this <a href="http://blog.cyclecomputing.com/2010/11/a-couple-more-nails-in-the-coffin-of-the-private-compute-cluster-gpu-on-cloud.html">recent benchmark posting</a> by CycleComputing shows that AWS&#8217; Cloud HPC system has impressive performance particularly for many MPI workloads.</p>
<p>HSC designed to accommodate HPC!</p>
<p>Which brings us back.</p>
<p><strong>The Moral of the Story</strong><br />
So, what we have learned is that scalable computing is different from computing optimized for performance.  That cloud can accommodate grid *and* HPC workloads, but is not itself necessarily a grid in the traditional sense.  More importantly, an extremely overlooked segment of grid (EPP) has pressing needs that can be accommodated by run-of-the-mill clouds such as EC2.  In addition to supporting EPP workloads that run on the &#8216;regular&#8217; cloud some clouds may also build out an area designed specifically for &#8216;HPC&#8217; workloads.</p>
<p>In other words, grid is not cloud, but there are some relationships and there is obviously a huge opportunity for cloud providers to accommodate this market segment.  At least, Amazon is spending 10s of Millions of dollars to do so, so why not you?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudscaling.com/blog/cloud-computing/grid-cloud-hpc-whats-the-diff/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Up, Out, Centralized, and Decentralized</title>
		<link>http://www.cloudscaling.com/blog/cloud-computing/up-out-centralized-and-decentralized/</link>
		<comments>http://www.cloudscaling.com/blog/cloud-computing/up-out-centralized-and-decentralized/#comments</comments>
		<pubDate>Tue, 28 Jul 2009 15:30:51 +0000</pubDate>
		<dc:creator>Randy Bias</dc:creator>
				<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[centralization]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[commoditization]]></category>
		<category><![CDATA[databases]]></category>
		<category><![CDATA[decentralization]]></category>
		<category><![CDATA[enterprise]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[infrastructure]]></category>
		<category><![CDATA[Internet Operations]]></category>
		<category><![CDATA[open]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[sharding]]></category>

		<guid isPermaLink="false">http://cloudscaling.com/blog/?p=433</guid>
		<description><![CDATA[It can be confusing to understand how to scale computing systems, but it&#8217;s not rocket science.  There are really only two main axes of scale: out and up.  Closely related to the axis of scale is the general type of &#8230; <a href="http://www.cloudscaling.com/blog/cloud-computing/up-out-centralized-and-decentralized/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><!--StartFragment--></p>
<p><img class="alignright size-medium wp-image-439" title="vert-scaling-diagram1" align="right" src="http://cloudscaling.com/wp-content/uploads/2009/07/vert-scaling-diagram1-269x299.png" alt="vert-scaling-diagram1" width="269" height="299" />It can be confusing to understand how to scale computing systems, but it&#8217;s not rocket science.  There are really only two main axes of scale: out and up.  Closely related to the axis of scale is the general type of architecture: centralized or decentralized.  In this article I&#8217;m going to briefly revisit scaling and then talk about centralized vs. decentralized architectures.</p>
<p class="MsoNormal"><strong>The Axes of Scale</strong></p>
<p class="MsoNormal">While ‘scaling’ is a popular topic, the scalability of a system is largely misunderstood.  You can scale ‘up’ (vertical) or ‘out’ (horizontal).<span> </span>Usually when people talk about scaling in the context of cloud computing, they mean a scale-out solution.<span> </span>This is because scale-up is not possible without control of the hardware, which you don&#8217;t usually have in a cloud computing scenario.<span> </span>Still, scale-up is a valid tactic for many situations such as scaling databases and fileservers.</p>
<p class="MsoNormal">Scale-out means use more of a single unit of resource (below), while scale-up means using fewer units while increasing the size of resource of each unit (up and right).</p>
<p class="MsoNormal">
<p class="MsoNormal"><span><!--[if gte vml 1]><v:shapetype  id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t"  path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"> <v:stroke joinstyle="miter" /> <v:formulas> <v:f eqn="if lineDrawn pixelLineWidth 0" /> <v:f eqn="sum @0 1 0" /> <v:f eqn="sum 0 0 @1" /> <v:f eqn="prod @2 1 2" /> <v:f eqn="prod @3 21600 pixelWidth" /> <v:f eqn="prod @3 21600 pixelHeight" /> <v:f eqn="sum @0 0 1" /> <v:f eqn="prod @6 1 2" /> <v:f eqn="prod @7 21600 pixelWidth" /> <v:f eqn="sum @8 21600 0" /> <v:f eqn="prod @7 21600 pixelHeight" /> <v:f eqn="sum @10 21600 0" /> </v:formulas> <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect" /> <o:lock v:ext="edit" aspectratio="t" /> </v:shapetype><v:shape id="_x0000_i1025" type="#_x0000_t75" style='width:308pt;  height:68pt;visibility:visible;mso-wrap-style:square'> <v:imagedata src="file://localhost/Users/randyb/Library/Caches/TemporaryItems/msoclip/0clip_image001.png" mce_src="file://localhost/Users/randyb/Library/Caches/TemporaryItems/msoclip/0clip_image001.png"   o:title="" /> <v:textbox style="mso-rotate-with-shape:t" mce_style="mso-rotate-with-shape:t" /> </v:shape><![endif]--></span></p>
<p class="MsoNormal" style="text-align: center;"><img class="aligncenter size-large wp-image-434" title="horizontal-scaling-diagram" src="http://cloudscaling.com/wp-content/uploads/2009/07/horizontal-scaling-diagram-1024x227.png" alt="horizontal-scaling-diagram" width="614" height="136" /></p>
<p class="MsoNormal">I believe scaling a system from this perspective is relatively well understood, but not necessarily the main ways in which scaling is achieved.</p>
<p class="MsoNormal"><strong>Centralized and Decentralized Systems</strong></p>
<p class="MsoNormal">At the heart of most architectural or design decisions is whether to build a centralized or decentralized system.<span> </span>Centralized systems can generally be purchased off the shelf, come in high-availability (HA) pairs, are technically simple to operate, provide vertical scaling capabilities, and are delivered at a premium price per unit.<span> </span>Well-designed centralized systems provide very high uptimes and availability. They can also be prone to catastrophic failures due to mis-configurations.  This happens because they usually have a synchronized configuration and a misconfiguration in one is propagated immediately to the other.<span> Given that misconfiguration is the number one source of failures or security breaches, this can be a major concern. </span>Examples of centralized systems include: redundant load balancer pairs, NAS / SAN systems delivered as HA units, and centralized network switches.</p>
<p class="MsoNormal">Decentralized systems by contrast are scaled horizontally, can be technically complex to operate, usually written in-house or built from open source, and priced such that individual units are relatively cheap.<span> </span>Distributed systems tend to be highly resilient in the face of the failure of a single unit because configurations are not shared. A unit’s failure has no impact on neighboring units. At most, the overall capacity is affected. Examples include: shared-nothing web/app tiers, pools of virtualized servers, top of rack switch deployments, and peer-to-peer (P2P) systems.</p>
<p class="MsoNormal">Most web systems do not require an either/or decision as they use a combination of centralized and decentralized components. A good example is a typical 2-tier web application stack like Ruby-on-Rails.<span> </span>The web/app tier will be scaled-out and the databases will be scaled-up.  Despite the hype around cloud computing, this is <strong>still</strong> the norm.  Even on Amazon&#8217;s EC2, your typical web app starts with the smallest instance size possible and then when scaling limits are hit, is upgraded to the next biggest instance size.</p>
<p class="MsoNormal">The decentralized approach has been out of favor historically due to technical complexity and low margins.  Also larger enterprises tend to lean in the direction of simpler to manage, centralized, scale-up solutions.<span> </span>Vendors prefer to sell fewer, larger, high margin solutions while enterprises like technically simple solutions.  This is beginning to change as we can see with folks like <a href="http://www.cloudera.com">Cloudera</a> providing commercial support for decentralized data processing systems like <a href="http://en.wikipedia.org/wiki/Hadoop">Hadoop</a>.</p>
<p class="MsoNormal"><span> </span></p>
<p class="MsoNormal">Centralized approaches have the distinct disadvantage of scale-up limits.<span> Meaning y</span>ou can only build a single server that is so large.<span> If your growth needs fit neatly inside </span><a href="http://en.wikipedia.org/wiki/Moores_Law">Moore&#8217;s Law</a><span>, you&#8217;re in luck, if not, you&#8217;ll have to scale-out instead of up. </span>At some point only a scale-out approach can continue to grow capacity.  This is why companies like Google and Amazon have chosen their particular ‘web-scale’ approaches.</p>
<div>
<div id="ftn">
<p class="MsoFootnoteText"><strong>Conclusion</strong></p>
<p class="MsoFootnoteText">Scaling up via centralized systems is still a viable architectural decision for those whose growth needs fit Moore&#8217;s Law.  Given the advent of cloud computing and the ability to add more servers when needed, scale-out tactics for building decentralized systems has been gaining more prevalence.  We will begin to see more and more scale-out solutions even within the enterprise as startups like Cloudera, <a href="http://www.parascale.com">ParaScale</a>, <a href="http://www.stackjet.com/">StackJet</a>, and many others build easier to manage decentralized systems.  I am very much looking forward to this new world as it solves a great many hard problems in a very efficient manner.  Just remember that scaling up will always be a viable and, in some cases, cost effective architectural decision.</p>
</div>
</div>
<p><!--EndFragment--></p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudscaling.com/blog/cloud-computing/up-out-centralized-and-decentralized/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>The Secret Sauce Problem</title>
		<link>http://www.cloudscaling.com/blog/uncategorized/the-secret-sauce-problem/</link>
		<comments>http://www.cloudscaling.com/blog/uncategorized/the-secret-sauce-problem/#comments</comments>
		<pubDate>Sun, 19 Jul 2009 03:25:38 +0000</pubDate>
		<dc:creator>Randy Bias</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Automation]]></category>
		<category><![CDATA[batch processing]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[Cloud Applications]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[N-tier]]></category>
		<category><![CDATA[secret sauce]]></category>

		<guid isPermaLink="false">http://cloudscaling.com/blog/?p=406</guid>
		<description><![CDATA[The vast majority of web applications have what I call The Secret Sauce Problem.  Every commercial web service of any kind needs to be differentiated in order to be interesting and attractive to customers.  There isn&#8217;t any kind of differentiation &#8230; <a href="http://www.cloudscaling.com/blog/uncategorized/the-secret-sauce-problem/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>The vast majority of web applications have what I call <em>The Secret Sauce Problem</em>.  Every commercial web service of any kind needs to be differentiated in order to be interesting and attractive to customers.  There isn&#8217;t any kind of differentiation in a typical 3-tier or N-tier[1] web application stack. This leads to needing secret sauce of some kind.  Secret sauce varies widely from application to application, but even between applications of the same kind there is a need to be differentiated. Two photo-sharing sites would typically build their backend photo processing and storage systems in different ways depending on the market they were trying to serve.</p>
<p><strong>The Secret in the Sauce</strong><br />
Secret sauce comes in many flavors and sizes.  Some sites need software, some need hardware, and others need special architectures.  More and more frequently cloud computing systems are used for those types of secret sauce that require some kind of batch processing.  Examples of batch processing applications include photo resizing and transcoding.  Cloud vendors like <a href="http://www.rightscale.com">RightScale</a> address the problem by providing a primary product that manages N-tier web applications and another one (<a href="http://support.rightscale.com/12-Guides/RightGrid_User_Guide/01-RightGrid_Overview">RightGrid</a>) that manages batch processing.  Many also try to roll-their-own using technologies like <a href="http://cloudscaling.com/blog/technology/big-data/hadoop-101-by-chris-wensel">Hadoop</a>.</p>
<p>This diagram illustrates how these particular solutions work:</p>
<p style="text-align: center; "><img class="aligncenter size-large wp-image-407" src="http://cloudscaling.com/wp-content/uploads/2009/07/secret-sauce-diagram-1024x549.jpg" alt="secret-sauce-diagram" width="502" height="269" /></p>
<p><strong>Secret Sauces of the World</strong><br />
Taking a look at some real-world examples will bring it home.  I&#8217;ve got a few below that I think will help us understand better.</p>
<ul>
<li><a href="http://www.scribd.com">Scribd</a>: Windows clusters for processing and transforming documents</li>
<li><a href="http://www.smugmug.com">SmugMug</a>: Photo <a href="http://blogs.smugmug.com/don/2008/06/03/skynet-lives-aka-ec2-smugmug/">resizing</a> and <a href="http://blogs.smugmug.com/don/2006/11/10/amazon-s3-show-me-the-money/">storage</a></li>
<li><a href="http://www.facebook.com">FaceBook</a>: <a href="http://www.facebook.com/note.php?note_id=39391378919">Highly scalable memcached</a> clusters, <a href="http://www.facebook.com/note.php?note_id=76191543919">Haystack</a> super scalable image storage system</li>
<li><a href="http://www.runa.com">Runa</a>: Real-time consumer purchasing analytics</li>
</ul>
<p>In each of these cases, the web service needs to differentiate.  I&#8217;m particularly fond of  Scribd&#8217;s secret sauce. The Scribd folks are largely a bunch of Linux &amp; Ruby-on-Rails geeks who realized early on that the majority of document processing tools were on the Microsoft Windows platform.  So, they built their own Windows document processing clusters for secret sauce.  That&#8217;s pragmatic <strong>and</strong> clever.</p>
<p><strong>Conclusion</strong><br />
Every web application needs something special to make it compelling to it&#8217;s chosen customer base.  This defines <em>The Secret Sauce Problem</em> that confronts every web service that grows to any significant size.  Although secret sauce can come in many forms, it is very common to be a batch processing application of some kind.  The interesting parts of any web application or service are not the web app itself, but the secret sauce.</p>
<p>If you don&#8217;t have the sauce, you&#8217;ll need it.  Get some now.</p>
<hr />[1] N-tier is essentially the same as a 3-tier, but represent web apps where there might be more than one app-tier</p>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudscaling.com/blog/uncategorized/the-secret-sauce-problem/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Hadoop 101 by Chris Wensel</title>
		<link>http://www.cloudscaling.com/blog/uncategorized/hadoop-101-by-chris-wensel/</link>
		<comments>http://www.cloudscaling.com/blog/uncategorized/hadoop-101-by-chris-wensel/#comments</comments>
		<pubDate>Mon, 22 Jun 2009 16:15:21 +0000</pubDate>
		<dc:creator>Randy Bias</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[databases]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[sharding]]></category>

		<guid isPermaLink="false">http://cloudscaling.com/blog/?p=350</guid>
		<description><![CDATA[What conversation about cloud computing is complete without a mention of big data, distributing processing, and distributed databases?  There is a recent trend away from relying exclusively on the traditional relational database for everything.  Newer technologies like BigTable and Hadoop &#8230; <a href="http://www.cloudscaling.com/blog/uncategorized/hadoop-101-by-chris-wensel/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>What conversation about cloud computing is complete without a mention of big data, distributing processing, and distributed databases?  There is a recent trend away from relying exclusively on the traditional relational database for everything.  Newer technologies like <a href="http://en.wikipedia.org/wiki/BigTable">BigTable</a> and <a href="http://en.wikipedia.org/wiki/Hadoop">Hadoop</a> provide an alternative mechanism for storing and processing large sets of data that don&#8217;t necessarily have extensive relationships needing modeling.  These technologies allow for a much more scalable solution.</p>
<p>In fact, they help in two ways: one by allowing an application to process more data using horizontal scalability (aka &#8216;elasticity&#8217;) and two by reducing load on the primary relational database and hence allowing you to go longer before &#8216;<a href="http://en.wikipedia.org/wiki/Shard_(database_architecture)">sharding</a>&#8216;.</p>
<p><a href="http://chris.wensel.net/">Chris Wensel</a> is the man when it comes to understanding Hadoop and he recently gave a couple of talks introducing Hadoop.  Here is one of them:</p>
<div id="__ss_1616859" style="width: 425px; text-align: left;"><a style="font:14px Helvetica,Arial,Sans-serif;display:block;margin:12px 0 3px 0;text-decoration:underline;" title="Building Scale Free Applications with Hadoop and Cascading" href="http://www.slideshare.net/cwensel/building-scale-free-applications-with-hadoop-and-cascading-1616859?type=presentation">Building Scale Free Applications with Hadoop and Cascading</a><object width="425" height="355" data="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=buildingscalefreeapps-090621175200-phpapp02&amp;stripped_title=building-scale-free-applications-with-hadoop-and-cascading-1616859" type="application/x-shockwave-flash"><param name="allowFullScreen" value="true" /><param name="allowScriptAccess" value="always" /><param name="src" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=buildingscalefreeapps-090621175200-phpapp02&amp;stripped_title=building-scale-free-applications-with-hadoop-and-cascading-1616859" /><param name="allowfullscreen" value="true" /></object></p>
<div style="font-size: 11px; font-family: tahoma,arial; height: 26px; padding-top: 2px;">View more <a style="text-decoration:underline;" href="http://www.slideshare.net/">Microsoft Word documents</a> from <a style="text-decoration:underline;" href="http://www.slideshare.net/cwensel">cwensel</a>.</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.cloudscaling.com/blog/uncategorized/hadoop-101-by-chris-wensel/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

