Breaking the Cache Bank

At the recent CoolTech Club meeting we enjoyed an excellent presentation by William Norton , Equinix co-founder and Chief Technical Liaison. Bill presented his calculations on cost of digitized media distribution over various Internet architectures (presentation full paper) .

Apparently, convenient Content Delivery Network (CDN – Akamai and others) architecture has a quite significant price tag – about $0.24 per full-format movie. Some cost improvement can be made by deploying content servers in few major peering points managed by Equinix ( surprise ! ;-)  - cost decrease to about $0.17 per movie. This is certainly improvement over reported $1 spend by Netflix for snail-mail delivery, but still to high for mass adoption (Also , it questionable if existing Internet infrastructure can survive significant volume of video content).

Alternative P2P ( peer-2-peer) delivery architecture cost to content owner/distributor significantly less – about $0.0018 – two order of magnitude over the best conventional architecture! However, the catch is that delivery expenses are not disappeared – it just passed on weak shoulders of ISPs and Internet Backbone. It is not surprise that some ISPs are pressured to deploy P2P “profiling” equipment – futile (and probably - illegal) attempt to suppress P2P content distribution across their networks. However, it is obvious that any long-term solution has to take in consideration not only profitability of the distribution business, but profitability of consumer ISPs businesses as well.

In our traditional after-party brainstorm, we “designed” a few architectures to achieve this goal – from “topology aware P2P” to massive set of “localized” peering points. The massive peering idea, as described by Bill Norton, called for large number (about 80,000) small peering points that interconnect all local ISPs. Those peering points serve as excellent location for content providers/distributors to deploy their “cache”/replication servers, “pushing” video content to the location as close as possible to the end user. Indeed, heavy video content will travel on short distance only, using cheap or free interconnect, releasing Internet backbone from pressure of high-volume multimedia traffic. This model has a lot advantages to ISPs and content distributors – cheap or free inter-domain traffic results in fix-cost service. This model is advantageous from end-user perspective – low delay, high quality of service over reliable last-mile connect, potential ability to see high-def content in streamable ( no-waiting) mode and so on…

Unfortunately, this architecture will not fly….Why ????

Humans are not perfect ………Caching works well when majority of users are asking for the same content. However, outside of North Korean “one party, one leader, one movie” domain, people probably have a quite diversified interests. If large portion of traffic generated by dissimilar content, then caching architecture may became to expensive to implement. Let’s investigate if it will be the case for large catalog of video content.

BitTorrent offers us an excellent model for investigation. Let’s consider [in]famous “The Pirate Bay” (TPB) – probably largest repository of pirated video content.  You can find virtually every US movie in this repository, as well as popular international ones, often translated into 2-3 major languages. As a pirated content is available for “free” , TPB usage trends can serve as a good model for flat-priced or add-supported legal subscription service with a large movie catalog ( comparable to 90,000 titles of Netflix). (note – many popular US serials are not part of ThePirateBay catalog – it will be a topic of separate investigation.)

I took a “scrape” snapshot from one of TPB “trackers” - http://tracker.prq.to/  ( TPB uses about 10 different trackers, but PRQ is seems to be the most popular). At the moment of the measurement, PRQ tracker coordinated distribution of over 400,000 torrents ( unique contents files) with almost 3.7m cumulative number of peers. Certainly, some peers may participate in few torrents, but usually no more then in 2-3 simultaneously. So we can safely assume that collected data represents about million unique users. Quite a good sampling population for statistical analysis!

I ranked torrents in accordance with number of participating peers and , assuming that bandwidth is proportional to number of peers,  calculated cumulative distribution of bandwidth as function of content popularity.

This graph describes relationship between movie/torrent popularity and traffic wasted for its delivery. Horizontal axis measure movie/torrent “rank” – its position in the list of torrents sorted by decreased popularity. Vertical axis represents portion of total bandwidth used by torrents ranked from 1 to the given one. For example, accordingly to the graph, 50,000 most popular torrents/movies are consuming about 70% of total traffic.

The same graph with logarithmic X scale gives us a close look on the “most popular content” :

A few immediate observations:

  • The classical 20/80 rule is still strong – 25% (100,000) most popular content uses 80% of traffic .
  • 10,000 most popular torrents ( or 2.5% of content library) uses just under 45% of traffic
  • 1,000 most popular torrents ( or 0.25%) use almost 20% of traffic.

And now a bad news for CDN/caching architecture – lets calculate a storage size required for significant ( 80%) drop in the traffic use. Accordingly to the graph above, we need enough storage to cache 100,000 (25%) most popular movies.

Decent DVD-quality DivX-compressed movie takes about 1.4G. 100k movies require at least 140 terabytes of high-speed storage.  In NetApp terms it will cost about $500k or more per piece.... And we’ll need such storage to be allocated in each of 80,000 peering point locations – or about $40Billions  in total….Even for much smaller Netflix catalog, we’ll get about $10Billions for storage only. Not a rocket science, but still quite significant investment….

Limited caching for top 0.25% of content seems to be more reasonable – about $100M for tflix-size catalog. But it saves less then 20% of traffic – quite limited results to justify such significant infrastructure investment….

Conclusion: useful CDN/caching/peering solution will break any reasonable budged..

Probably “topology aware P2P” is a more reasonable model, but it is another story to discuss ….

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

P2P architecture is good for

P2P architecture is good for selective downloads. However it has limitations in the selection of content. I spent quite a bit of time working on this segment (Stas knows) and for live streaming with large selection of content the P2P simply wouldn't scale.

BTW. Even though Joost promotes it's service as P2P the largest portion of it's content is kept in some central depository.

Bringing bandwidth dicsussion from email


Stas,

This is indeed an interesting discussion for me,
because I had studied this area myself in the past. I
have read your blog, and it seems there are several
concerns that you did not address.

The most important concern with respect to P2P video
distribution in my view, is the limitted bandwidth
that most ISPs provision to the "upstream" side.
Both DSL and DOCSIS typically allow less than 1 Mbps
from the customer into the network. Assuming a video
stream requirement of not less than 1 Mbps, the
problem is fairly obvious:

With a typical Peer having upload bandwidth of < 512
Kbos, half or less of what a typical video stream may
require, it seems difficult to see how P2P streaming
architecure might work.

Unless of course you place your bets on the ISPs
increasing the "up" as well as the "down" portion of
their last miule, which would be difficult for the
ISPs to justify financially, and seems as long a shot
as is the edge-cache architecture ...

- Leonid

There will be an attempt to solve this issue a-la Joost, based on the fact that most of us do not keep watching the TV 24/7. Therefore, the theory goes, there are always more uploaders than downloaders.

To me, this theory is pregnant with problems (and I thus partially agree with Leonid):

- One needs to be persuade the users to keep their PCs always on. Unlike Skype, the benefits of doing this in case of video are unclear. It is next to impossible to coerce me into doing this by limiting my download otherwise (however, this might work for other segments of the online population).

- The TV watching is a highly synchronized mass activity. So bandwidth can be a problem even if everyone is talked into keeping his equipment "always on". Should I say, "Superbowl"?

- There is a nonzero probability that users will keep their video running 24/7 with sound off if they keep their PCs on.

Dima.

$40B for 11,2000 PB, not 140TB

I got confused over the numbers: you consider $40B for 140TB * 80000 locations, which brings the total to 11,200 petabytes, right?

If you shop around, you would probably end up with 1/2 or even 1/4 or this number, but still...

P2P scaling: nasa, blizzard, and netflix

I think I'm reading that some folks agree that if lotsa people keep a PC always on as a P2P node, then P2P for large scale content distribution could work. (I don't disagree though I'd add that we have the free-rider problem: even a large number of people do always-on, but a much larger group don't, then the latter people can reduce the group benefit and provide disincentive to the former group.)

That makes me wonder: what would it take to convince people to leave a PC always on as a P2P node? One proven motivator is brand and service, and here some examples. I can't say that they are illuminating examples, but maybe food for thought. A lot of people play World of Warcraft, and practically every day they need to download and update to play. And while they play, their PC is P2P client for distributing updates to other players. I wonder if Netflix customers would be motivated to do always-on as an approach to getting their flix quicker without the USPS. Trust would be a powerful motivator too, both for safety/security (P2P has a bad rep there) and for a "fair deal" (not having your computer maxed as a server for others) and for enforcement of other parameters like disk usage, etc. NASA's SETI program is an existing example of a brand and trust combination where people trust NASA to send compute directives to their PCs to work in in idle cycles - sort of the first bot net if you think about it from today's perspective.

One last note, though not P2P, I have a computer always in my house to download video for me : my DVR. I have no problem leaving it on all the time because it records stuff for me, and because it is such a cruddy PC that if I power cycle it, it gets confused for hours ;-) If I connected it to the world via Ethernet and broadband, it could easily be a P2P client/server and I'd never know or care. So maybe that's what's needed an affordable P2P DVR appliance? I don't know - let's ask George at the next meeting.

EJS

P2P and video distribution

As matterof fact, i do not belive that conventional PC is a good device for video entertainment - TV settop-box is.
Probably, it just issue of external designe, but many todays TV settop-box/DVRs are essentialy Linux-run PC enclosed in some "family-room-friendly" box.
For those kind of "PCs", issue of downtime is non existing - they are ALWAYS ON! So putting P2P-based video distribution mechanism on thise sort of devices will be a soundsolution for pre-recorded video distribution.
Distribution of the real-time content ( like news or sport events) is quite different kind of problem - i do not belive taht caching and event conventional P2P can help here. Probably right solution will be some mix of P2P and IP-multicast... Quite interesting topic to discuss...