Category Archives: Data

Match Charting Project: More Matches, More Data, New Spreadsheet

The Match Charting Project keeps growing, and starting today, even more of the data is available for anyone who wants it. Several new contributors have helped us pass the 750-match milestone, having added an average of two matches per day since I first published the raw data.

New spreadsheet

The Match Charting spreadsheet now does a lot more. As you chart each point, the document updates stats for the match–both total and set-by-set. You’ll find the same stats you see on television (aces, double faults, winners, unforced errors, etc) along with some that are a little less common, like winning percentage in different lengths of rallies, and most consecutive points won.

In other words, As you chart the match, you’ll have access to many of the same stats that commentators do. Here’s what it looks like:

danka

If you’ve hesitated to try charting because you couldn’t see what was in it for you, I hope this changes the calculation a bit.

Click here to download MatchChart 0.1.4.

(If you prefer to use the lighter-weight version 0.1.2, that’s fine too.)

New data

About a month ago, I published the point-by-point data from all charted matches.  In raw form, it’s a bit daunting, and it’s more than what’s necessary for many interesting research projects.

Today, I added 15 different aggregate stats files for men, and another 15 for women. These contain the data that is shown in each charted match report. For instance, if you find it interesting that Simona Halep hit 14% of her backhands down the line in the Indian Wells final, you can take a look in the ShotDirection stats file and compare that number with the results from Halep’s other charted matches, or all matches in the database as a whole.

You can find these files (along with the updated raw data for 760+ matches) by clicking here.

Chart some matches

If you haven’t already, now is a great time to start charting professional matches and contributing to the project. An enormous number of matches are televised and streamed, and as the database of charted matches grows, there’s more and more useful context to all the data we’re generating.

You can start by jumping into the ‘Instructions’ tab of the new MatchChart spreadsheet, or for other tips, you can start with my blog post introducing the project.

1 Comment

Filed under Data, Match charting

Free ATP and WTA Results and Stats Databases

The vast majority of my men’s and women’s tennis results and stats databases are now free for anyone who wants to use them.

ATP Results and Stats:

  • Tour-level results back to 1968, with tons of data on both players in each match (age, handedness, country, rank), and matchstats from 1991-present.
  • Almost a decade of tour-level qualifying matches, with matchstats for the last few years.
  • Challenger results back to 1991, with matchstats for almost the last ten years.
  • Futures (and Satellite) results back to 1991.
  • Linked biographical and rankings data (introduced here).

WTA Results:

  • Tour-level results back to 1968, with the same player data as in the ATP files.
  • Tour-level qualifying matches.
  • Over 220,000 ITF main-draw matches.

Click the links to access the files. Enjoy!

3 Comments

Filed under Data

Free ATP and WTA Ranking Databases

More data!

Today I’ve made available my entire ATP and WTA ranking databases through the end of the 2014 season. In addition, you’ll find my complete player tables, which include birthdate, country, and handedness for every player who has ever been ranked or played a tour-level match. (Plus thousands more players, who are included in the database for other reasons.)

This is all the data you need to research all sorts of topics, like the rise and fall of certain countries in the rankings and the changing age of top 10s, 50s, and 100s.

This is the third major dataset I’ve published this week, and more is on the way.

ATP rankings are here, and WTA rankings are here. Enjoy!

4 Comments

Filed under Data

Raw Data From The Match Charting Project

In the last year and a half, dozens of contributors and I have amassed detailed shot-by-shot records of nearly 700 professional matches. You can see the full list here, or a menu sorted by player here.

I refer to this as The Match Charting Project, and I hope you’ll consider contributing as well. Using a straightforward text notation system, we record shot type, shot direction,  return depth, error types, and more. The more matches, the more interesting the results. The project made up part of my presentation at the Sloan Sports Analytics Conference last month, which included some very preliminary findings on player tendencies.

Now, you can dig into the raw data yourself. I’ve posted all of the user-submitted match charts in one place, in a standardized format for anyone who wants to mess around with it.

Enjoy!

 

3 Comments

Filed under Data, Match charting

Point-by-Point Data From the Last 17 Grand Slams

I’ve been doing a lot of griping lately about the state of tennis data, so I figured now was a good time to start doing something about it.

I’ve just released point-by-point data for most Grand Slam singles matches back to 2011. Beyond the basic point sequence–which is valuable in and of itself–you’ll find serve speed, winner type, and for a few of the slams, rally length for each point.

More detailed notes on the data are available at that link. Enjoy, and if working with it turns up any interesting findings, please let me know.

Leave a comment

Filed under Data

Analytics That Aren’t: Why I’m Not Excited about SAP in Tennis

It’s not analytics, it’s marketing.

The Grand Slams (with IBM) and now the WTA (with SAP) are claiming to deliver powerful analytics to tennis fans.  And it’s certainly true that IBM and SAP collect way more data than the tours would without them.  But what happens to that data?  What analytics do fans actually get?

Based on our experience after several years of IBM working with the Slams and Hawkeye operating at top tournaments, the answers aren’t very promising.  IBM tracks lots of interesting stats, makes some shiny graphs available during matches, and the end result of all this is … Keys to the Match?

Once matches are over and the performance of the Keys to the Match are (blessedly) forgotten, all that data goes into a black hole.

Here’s the message: IBM collects the data. IBM analyzes the data. IBM owns the data. IBM plasters their logo and their “Big Data” slogans all over anything that contains any part of the data. The tournaments and tours are complicit in this: IBM signs a big contract, makes their analytics part of their marketing, and the tournaments and tours consider it a big step forward for tennis analysis.

Sometimes, marketing-driven analytics can be fun.  It gives some fans what they want–counts of forehand winners, or average first-serve speeds. But let’s not fool ourselves. What IBM offers isn’t advancing our knowledge of tennis. In fact, it may be strengthening the same false beliefs that analytical work should be correcting.

SAP: Same Story (So Far)

Early evidence suggests that SAP, in its partnership with the WTA, will follow exactly the same model:

SAP will provide the media with insightful and easily consumable post-match notes which offer point-by-point analysis via a simple point tracker, highlight key events in the match, and compare previous head-to-head and 2013 season performance statistics.

“Easily consumable” is code for “we decide what the narratives are, and we come up with numbers to amplify those narratives.”

Narrative-driven analytics are just as bad–and perhaps more insidious–than marketing-driven analytics, which are simply useless.  The amount of raw data generated in a tennis match is enormous, which is why TV broadcasts give us the same small tidbits of Hawkeye data: distance run during a point, average rally hit point, and so on.  So, under the weight of all those possibilities, why not just find the numbers that support the prevailing narrative? The media will cite those numbers, the fans will feel edified, and SAP will get its name dropped all over the place.

What we’re missing here is context.  Take this SAP-generated stat from a writeup on the WTA site:

The first promising sign for Sharapova against Kanepi was her rally hit point. Sharapova made contact with the ball 76% of the time behind the baseline compared to 89% for her opponent. It doesn’t matter so much what the percentage is – only that it is better than the person standing on the other side of the net.

Is that actually true? I don’t think anyone has ever published any research on whether rally hit point correlates with winning, though it seems sensible enough. In any case, these numbers are crying out for more context.  Is 76% good for Maria? How about keeping her opponent behind the baseline 89% of the time? Is the gap between 76% and 89% particularly large on the WTA? Does Maria’s rally hit point in one match tell us anything about her likely rally hit point in her next match?  After all, the article purports to offer “keys to match” for Maria against her next opponent, Serena Williams.

Here’s another one:

There is a lot to be said for winning the first point of your own service game and that rung true for Sharapova in her quarterfinal. When she won the opening point in 11 of her service games she went on to win nine of those games.

Is there any evidence that winning your first point is more valuable than, say, winning your second point?  Does Sharapova typically have a tough time winning her opening service point?  Is Kanepi a notably difficult returner on the deuce side, or early in games?  “There is a lot to be said” means, roughly, that “we hear this claim a lot, and SAP generated this stat.”

In any type of analytical work, context is everything.  Narrative-driven analytics strip out all context.

The alternative

IBM, SAP, and Hawkeye are tracking a huge amount of tennis data.  For the most part, the raw data is inaccessible to researchers.  The outsiders who are most likely to provide the context that tennis stats so desperately need just don’t have the tools to evaluate these narrative-driven offerings.

Other sporting organizations–notably Major League Baseball–make huge amounts of raw data available.  All this data makes fans more engaged, not less. It’s simply another way for the tours to get fans excited about the game. Statheads–and the lovely people who read their blogs–buy tickets too.

So, SAP, how about it?  Make your branded graphics for TV broadcasts. Provide your easily consumable stats for the media.  But while you’re at it, make your raw data available for independent researchers. That’s something we should all be able to get excited about.

10 Comments

Filed under Data, Keys to the match, Research

US Open Draw Datasets

Earlier today, I published a thorough analysis of the last ten years of US Open draws, showing that while first and second seeds have had extremely easy first-round matchups, there is no other credible statistical evidence that suggests any nonrandom manipulation of the draw.

If you want to take a look at the draws yourself, I’ve made it easier.  The following files not only have the full draws going back to 2001, but they also include each player’s ATP or WTA ranking at the time of the tournament, their ordinal ranking among the players in the draw, the ordinal ranking of their first-round opponent, and the ordinal ranking of their best-possible second round opponent.

Click to download the files:

Here’s a quick rundown of the columns you’ll find in each sheet:

  • Year — each file contains the entire draws for the last ten years.
  • Draw Pos[ition] — numbers 1 to 128, so you can always sort the sheet to show the players in draw order.  (For instance, the #1 seed is 1, that player’s opponent is 2, and so on.)
  • Player
  • Country
  • Seed — the seeding assigned by the US Open
  • Rank [ATP/WTA] — the player’s official ranking the Monday that the tourney began.
  • Ordinal — the player’s rank among the 128 players in the field.  Last year, Shelby Rogers’s WTA ranking was 344, which made her ordinal ranking 124 out of 128.
  • 1stRdOpp — the ordinal ranking of the player’s first-round opponent.
  • Best2nd — the ordinal ranking of the player’s best possible second-round opponent.
Let me know if you find anything interesting!

1 Comment

Filed under Data, Research, U.S. Open