Category Archives: Data

Analytics That Aren’t: Why I’m Not Excited about SAP in Tennis

It’s not analytics, it’s marketing.

The Grand Slams (with IBM) and now the WTA (with SAP) are claiming to deliver powerful analytics to tennis fans.  And it’s certainly true that IBM and SAP collect way more data than the tours would without them.  But what happens to that data?  What analytics do fans actually get?

Based on our experience after several years of IBM working with the Slams and Hawkeye operating at top tournaments, the answers aren’t very promising.  IBM tracks lots of interesting stats, makes some shiny graphs available during matches, and the end result of all this is … Keys to the Match?

Once matches are over and the performance of the Keys to the Match are (blessedly) forgotten, all that data goes into a black hole.

Here’s the message: IBM collects the data. IBM analyzes the data. IBM owns the data. IBM plasters their logo and their “Big Data” slogans all over anything that contains any part of the data. The tournaments and tours are complicit in this: IBM signs a big contract, makes their analytics part of their marketing, and the tournaments and tours consider it a big step forward for tennis analysis.

Sometimes, marketing-driven analytics can be fun.  It gives some fans what they want–counts of forehand winners, or average first-serve speeds. But let’s not fool ourselves. What IBM offers isn’t advancing our knowledge of tennis. In fact, it may be strengthening the same false beliefs that analytical work should be correcting.

SAP: Same Story (So Far)

Early evidence suggests that SAP, in its partnership with the WTA, will follow exactly the same model:

SAP will provide the media with insightful and easily consumable post-match notes which offer point-by-point analysis via a simple point tracker, highlight key events in the match, and compare previous head-to-head and 2013 season performance statistics.

“Easily consumable” is code for “we decide what the narratives are, and we come up with numbers to amplify those narratives.”

Narrative-driven analytics are just as bad–and perhaps more insidious–than marketing-driven analytics, which are simply useless.  The amount of raw data generated in a tennis match is enormous, which is why TV broadcasts give us the same small tidbits of Hawkeye data: distance run during a point, average rally hit point, and so on.  So, under the weight of all those possibilities, why not just find the numbers that support the prevailing narrative? The media will cite those numbers, the fans will feel edified, and SAP will get its name dropped all over the place.

What we’re missing here is context.  Take this SAP-generated stat from a writeup on the WTA site:

The first promising sign for Sharapova against Kanepi was her rally hit point. Sharapova made contact with the ball 76% of the time behind the baseline compared to 89% for her opponent. It doesn’t matter so much what the percentage is – only that it is better than the person standing on the other side of the net.

Is that actually true? I don’t think anyone has ever published any research on whether rally hit point correlates with winning, though it seems sensible enough. In any case, these numbers are crying out for more context.  Is 76% good for Maria? How about keeping her opponent behind the baseline 89% of the time? Is the gap between 76% and 89% particularly large on the WTA? Does Maria’s rally hit point in one match tell us anything about her likely rally hit point in her next match?  After all, the article purports to offer “keys to match” for Maria against her next opponent, Serena Williams.

Here’s another one:

There is a lot to be said for winning the first point of your own service game and that rung true for Sharapova in her quarterfinal. When she won the opening point in 11 of her service games she went on to win nine of those games.

Is there any evidence that winning your first point is more valuable than, say, winning your second point?  Does Sharapova typically have a tough time winning her opening service point?  Is Kanepi a notably difficult returner on the deuce side, or early in games?  “There is a lot to be said” means, roughly, that “we hear this claim a lot, and SAP generated this stat.”

In any type of analytical work, context is everything.  Narrative-driven analytics strip out all context.

The alternative

IBM, SAP, and Hawkeye are tracking a huge amount of tennis data.  For the most part, the raw data is inaccessible to researchers.  The outsiders who are most likely to provide the context that tennis stats so desperately need just don’t have the tools to evaluate these narrative-driven offerings.

Other sporting organizations–notably Major League Baseball–make huge amounts of raw data available.  All this data makes fans more engaged, not less. It’s simply another way for the tours to get fans excited about the game. Statheads–and the lovely people who read their blogs–buy tickets too.

So, SAP, how about it?  Make your branded graphics for TV broadcasts. Provide your easily consumable stats for the media.  But while you’re at it, make your raw data available for independent researchers. That’s something we should all be able to get excited about.

10 Comments

Filed under Data, Keys to the match, Research

US Open Draw Datasets

Earlier today, I published a thorough analysis of the last ten years of US Open draws, showing that while first and second seeds have had extremely easy first-round matchups, there is no other credible statistical evidence that suggests any nonrandom manipulation of the draw.

If you want to take a look at the draws yourself, I’ve made it easier.  The following files not only have the full draws going back to 2001, but they also include each player’s ATP or WTA ranking at the time of the tournament, their ordinal ranking among the players in the draw, the ordinal ranking of their first-round opponent, and the ordinal ranking of their best-possible second round opponent.

Click to download the files:

Here’s a quick rundown of the columns you’ll find in each sheet:

  • Year — each file contains the entire draws for the last ten years.
  • Draw Pos[ition] — numbers 1 to 128, so you can always sort the sheet to show the players in draw order.  (For instance, the #1 seed is 1, that player’s opponent is 2, and so on.)
  • Player
  • Country
  • Seed — the seeding assigned by the US Open
  • Rank [ATP/WTA] — the player’s official ranking the Monday that the tourney began.
  • Ordinal — the player’s rank among the 128 players in the field.  Last year, Shelby Rogers’s WTA ranking was 344, which made her ordinal ranking 124 out of 128.
  • 1stRdOpp — the ordinal ranking of the player’s first-round opponent.
  • Best2nd — the ordinal ranking of the player’s best possible second-round opponent.
Let me know if you find anything interesting!

1 Comment

Filed under Data, Research, U.S. Open