Tag Archives: XML

App Store Data Mining Techniques Revealed – Part 2: Scripting App Store XML Downloads

Welcome back. The first article in this series introduced App Store data mining fundamentals, principally that iTunes works essentially like a browser, except that instead of rendering HTML iTunes uses XML data to generate its views.

In part one, we used a proxy as a man in the middle to save a copy of some interesting data from an iTunes session to disk. Using a proxy is handy for ad-hoc data mining tasks. However, for recurring tasks, it’s handier to leave the proxy and iTunes behind and grab the XML directly. This article will show you how.

We’ll modify our earlier calculate the average selling price of the top-grossing apps example to automatically pull down the XML data it needs:


As was the case before, I’ll be using the Charles Proxy to aid my exploration. As you did in part-1, with the proxy running, open up iTunes and navigate to the full list of the Top Grossing apps.

Previously, I used Charles’ search capabilities to to find the HTTP request that resulted in the XML for the Top Grossing apps. That’d work here too, but this time I’ll take a different tact:

Charles’ lets you interact with a browsing session by looking at requests in either a directory structure like way — showing folders, sub-folders and files in a Finder like hierarchy — and in an ordered sequence.

Using the Sequence view and a bit of filtering will quickly get us to the data we’re after. The XML requests to iTunes are served by URLs that contain WebObjects in their paths. Filter on WebObjects and then hunt for the most recent relatively large request and you’ll find the URL is:


User Agent

Paste that URL into a browser or fetch it with Curl and you’ll get a web page back. The iTunes HTTP servers match the request’s User Agent and return HTML unless the requestor appears to originate from iTunes. This is what makes it possible to email iTunes URLs (e.g., to your app) not leave users stranded with a browser trying to render unfamiliar XML.

So, we need include a User-Agent header identifying ourselves as iTunes. According to Charles, the User-Agent provided by iTunes was:

iTunes/9.0.2 (Macintosh; Intel Mac OS X 10.6.2) AppleWebKit/531.21.8 

We’ll include that in our request.

Putting It All Together

Starting with the code from the previous article in this series, I’ve modified it to make the HTTP request with the correct User-Agent header and run the resulting XML through the parser to pluck out the prices and calculate the ASP. Here’s the result:

#!/usr/bin/env ruby require 'rubygems' require 'hpricot' require 'net/http' Net::HTTP.start( 'ax.itunes.apple.com', 80 ) do |http| doc = Hpricot(http.get('/WebObjects/MZStore.woa/wa/viewTopLegacy?id=25204&popId=38&genreId=36', "User-Agent" => "iTunes/9.0.2 (Macintosh; Intel Mac OS X 10.6.2) AppleWebKit/531.21.8" ).body) total = 0.0; doc.search("//textview[@styleset=\"basic11\"]/setfontstyle/b").each do |i| total += i.inner_text[1..-1].to_f end puts "Top Grossing Apps' ASP: $#{total / 100.0}" end 

Of note to a few: iTunes’ servers used always serve the content up gzipped, requiring an additional step to gunzip it before it could be used. I forgot to include the gunzip code and, when I stopped to think about it, was surprised to see that it worked without it.

( Source :www.mobileorchard.com )

App Store Data Mining Techniques Revealed – Part 1

The App Store is a treasure trove of data. App Store data can help you pick a category/segment, track trends, find the right price point, chart the total number of apps, track the rate of app approval and much more.

App Store data mining isn’t magic. It’s about finding data that’s exposed in iTunes, extracting it in a machine parse-able format, and doing something with it. This article will demonstrate each of those steps; further articles will expand on this topic.

As is almost always the case, this is best explained with an example. I’ll use a straightforward example to: let’s calculate the average selling price (ASP) for the top grossing apps.


The App Store in iTunes contains, in its various views, the superset of the available data. The first step is to find places inside the iTunes App Store that expose the data you want to crunch.

Finding a place in iTunes that presents data that we’ll use to calculate the ASP for the top grossing apps is straightforward:

From the App Store home screen, click See All in the Top Grossing segment of the Top Charts panel shows the full list of top grossing apps and their prices.


The iTunes store works like a browser. It uses HTTP like a browser, but instead of parsing HTML it consumes XML.

When iTunes shows the page of Top Grossing apps each app is represented by a block of XML that contains the title, price, update-date, a link to an icon, etc.

We’ll use this XML to calculate the ASP for the Top Grossing apps, but first we need to grab the XML and write it to disk:


The easiest way to get iTunes XML data is to use a proxy as a man in the middle. Put a proxy in that can also write what it sees to disk and you’ll be in business.

I highly recommend, and for this article will be using, Karl von Randow’s Charles Proxy. Charles is designed for exactly this kind of task: as you use iTunes (or a browser) it records all of the headers and content that pass through it and provides you with tools to manipulate, search, filter, display and export the data.

The alternative to using the Charles proxy is to roll your own. Before becoming a Charles convert I wrote a proxy in Ruby. I was only interested in the XML data, so I wrote code to filter on content type. Then I needed to decompress the gzipped content, so I wrote code for that. Then I wrote code to name the files in a useful way. Etc.

Charles is $50. Easily worth it vs. the code you’ll have to write to do this yourself, especially when you’re in the exploring phase where Charles let’s you quickly pin down exactly where the data you’re looking for came across the wire. Needless to say, I’ve no commercial interest in Charles, I’m recommending it on it’s merits. You can see for yourself with a free, 30 day trial.

Locating And Saving The Data

Fire up your proxy and open iTunes. Charles automatically configures OS-X to place itself inline as a proxy after you grant it permission to do so.

In iTunes click the iTunes Store item in the left column, select the App Store from the iTunes Store’s top menu bar and click See All in the Top Grossing segment of the Top Charts panel.

Take a look at what’s crossed the wire: iTunes makes HTTP requests to a number of different hosts. The two that’ll likely be of most interest are: a1.phobos.apple.com and ax.itunes.apple.com. The former serves app icon images and the latter serves what we’re after: XML data.

We’re interested in the XML data that iTunes used to render the Top Grossing screen. Finding the right file amongst the lot of them is easiest accomplished by searching for some bit of text that’ll only show up in the file we’re after.

Best bet here is to pick the title of one of the apps at the tail end of the list — those aren’t likely to be featured and won’t show up in top-10 list that’s in the App Store’s top level page. Charles makes quick work of searching: Command-F to bring up the search dialog, enter the text to find, search across all the files by choosing the Session scope and click Find.

If you picked your search term wisely it’ll show up several times in one file. Double click any row in the search results to view the item. Right-click or Control-click inside the content pane, choose Save Response… and store the results to disk.

Manipulating The Data

The XML data is large — 22,000 lines for our sample — and hard to comprehend.

Rather than trying to fully understand its format, simplify things by searching for prices.

Prices show up in a number of places, e.g., in the alt-attributes for some images. Parsing it out of those isn’t ideal. However, prices show up almost unadorned nested in this structure:

<TextView topInset="0" truncation="right" leftInset="0" styleSet="basic11" textJust="left" maxLines="1"> <SetFontStyle normalStyle="matrixTextFontStyle"> <b>$9.99</b> </SetFontStyle> </TextView>

XPath exists to make it easy to pluck out values from structures in XML. This XPath search will pluck out the prices our document:


Using Ruby’s Hpricot library, this 14 line script does the work of grab each of price, removing the leading dollar-sign and then calculating the average price:

#!/usr/bin/env ruby require 'rubygems' require 'hpricot' doc = Hpricot(File.read("topgrossing.xml")) total = 0.0; doc.search("//textview[@styleset=\"basic11\"]/setfontstyle/b").each do |i| total += i.inner_text[1..-1].to_f end puts "Top Grossing Apps' ASP: $#{total / 100.0}" 

And we’ve arrived at our goal!

More To Come

This example is straightforward. All of the data is in contained as the response to one HTTP request. In a future post I’ll talk about techniques for scripting a series of requests to gather pieces of a complete data set. Stay tuned!

( Source : mobileorchard.com )