Resist Data Integration Redundancy


The Internet makes massive amounts of data available for lots of interesting applications. But whenever you design a unique analysis and presentation of information you don’t privately control, you risk that the owner will offer the same view at some point in the future, instantly making your application redundant.

That’s exactly what happened to the Groupon API data-mining project we originally wrote about in August, 2011. Fortunately, the core of our project is a MapForce graphical data mapping. We can quickly and easily tweak the mapping and repurpose it to present an entirely different data set that provides new value.

HTML output from MapForce and StyleVision

Our project initially began when we noticed Groupon deals only offered in a few individual locations that could actually be redeemed over the Web for physical merchandise to be shipped almost anywhere.

We used MapForce to query the Groupon API for all offers from every Groupon location, we filtered out the offers classified as online deals, and we presented them in an HTML page elegantly formatted by Altova StyleVision for desktop and mobile devices.

The new Goods tab recently added to the top of the Groupon Web page makes our original data mapping completely redundant, since it offers immediate access to items for sale online from many locations.

Groupon Menu Bar Even worse, since most of the same goods are offered in nearly every Groupon location, our mapping output now generates dozens of duplicates.

Repurpose the Application

Thinking at a more meta level, the justification for our original project is still valid: Groupon organizes and displays deals based on a geographical query, but there are instances where a deal is more attractive than the location. For example, a trip to Allentown, Pennsylvania might not be on your bucket list, but what if you knew about a Groupon deal to drive a Ferrari, Lamborghini, or Aston Martin five or ten laps around the Pocono Racetrack for half the usual price?

The Ferrari offer isn’t an Online Deal, so it’s not listed under the Goods tab, nor is it selected by our MapForce data mapping. As a new target for our mapping design, let’s collect all Groupon offers from all locations that are NOT classified as Online Deals. There are probably lots of interesting things to do in places that might not immediately spring to mind. Here’s the section of our original data mapping that filtered the data response from the API to select Online Deals:

MapForce mapping selects Online deals

The contains function in the center of the screenshot checks to see if an element called <redemptionLocation> in the deal description contains the word Online, indicating an online deal. The deal is passed along for further processing only if the result is true.

The logical-and function at the top right combines online deals with a test of the element called <isSoldOut> to select only the offers that are still available ( <isSoldOut> = false).

We can very easily reverse the set of collected data by inserting a logical-not function after the contains function. The new mapping selects all deals that do NOT contain Online in the <redemptionLocation> element.

MapForce mapping selects NOT Online deals

Since the structure of the data does not change, only the content, we don’t have to do anything else before executing the new version of the mapping. Here is a portion of the XML output showing the Ferrari deal:

Portion of MapForce XML Output We could take this output file and immediately process it through StyleVision using our original stylesheet to create an HTML document, but while we’re in MapForce, let’s add two more enhancements.

Remove Duplicate Data

We are still getting some duplicates in the new results because the same deals are frequently offered in multiple neighborhoods of large cities. One of the samples installed with MapForce is a mapping called DistinctArticles.mfd that demonstrates how to remove duplicates from an input stream where XML nodes contain repeated data.

We can easily copy the design from the example to our Groupon mapping:

MapForce removes XML nodes with duplicate content The <title> element functions as a unique key to identify duplicate deals, and the compute-when variable sends only the first copy for further processing.

Of course we can also apply this de-duplication strategy to the original mapping of Online deals, to find out if the Goods tab for one location really offers all the Online deals out there. (It doesn’t.)

Data Sorting

A new feature added to MapForce 2012 Release 2 allows us to sort the data before it reaches the output file. Here is the section of the mapping that sorts first by division names, which are the Groupon locations for deals, then by the deal title within each location.

Data sorting in MapForce

Now we can process the completed mapping and generate an HTML document by transforming the XML output file with our original StyleVision stylesheet:

HTML output transformed by StyleVision

Maybe we can even get a deal on a tasty Italian snack after driving a fast Italian car! MapForce, and StyleVision are all available together in the specially priced Altova MissionKit. See for yourself how easy it is to use the MissionKit to integrate data from a Web API — download a free 30-day trial!

Tags: , , ,