Getting involved with park factors can be dangerous. There is a lot of confusion/disagreement out there about the purpose
of park factors, how they should be calculated, how they should be applied, what they really mean, etc. Hopefully this short
word on them will clarify my views and give some sense of the philosophy behind the park factors presented here.
Why Do We Need Park Factors?
Park factors are needed because the thirty major league ballparks all affect the game of baseball in thirty different ways.
Sometimes these effects are drastic(Coors Field or Dodger Stadium) and sometimes they're almost negligible. But if you want
to fairly value ballplayers, you need to take context into account. PFs are one way to do that.
An argument against PFs, usually made by sabermetric opponents, is something along the lines of "Life is not fair; some players
have the fortune to play in a park conducive to their talents, others don't. Deal with it." I believe that there is a lot
of truth in this argument; for instance, I don't seek to right all the perceived injustices of the world in doing sabermetrics(like,
"what if Edgar Martinez played before the DH" or, on the flip side of the same debate, "what if Edgar Martinez wasn't wasted
by Seattle for three or four years when he was a good player"). But the argument ignores the cardinal rule of sabermetrics:
that it is wins and losses(or as building blocks, runs and outs) that matter. A Dante Bichette who has the good fortune to
play at Coors Field will put up gaudy numbers--but even in that context, they do not translate to wins. Park Factors are
a step along the way of determining the value(or the theoretical value under other circumstances) or player performance.
Martinez may have been helped by the DH rule--but it was a reality and using it, he produced real wins for the Mariners.
Bichette was helped by Coors Field--but in doing so, it didn't help the Rockies win baseball games.
How Should Park Factors be Constructed and Calculated?
To answer this question, you first must ask yourself the all-important sabermetric question, "What are you trying to measure"(See
the "Ability v. Value" article on this site for more details on the terms I will be throwing around). If you want to measure
ability, then certain assumptions and methods may be appropriate--but if you are seeking a value measurement, then those same
assumptions can be woefully inappropriate. I will discuss a number of various ways to do PFs and when they should be used:
RUNS OR COMPONENTS?
Most published PFs are run park factors(e.g. they measure the park's effect on run scoring). This could be for a variety
of reasons: it is easier to use one PF then six or seven, the data is more available(especially pre-Retrosheet), or that is
the appropriate choice for the question at hand. However, other people publish event specific park factor, such as the Wrigley
Field park factor for doubles.
Component park factors are, at least in my opinion, possibly appropriate for performance measures and absolutely appropriate
for ability measures. A park may increase run scoring by 10%, but it does not effect all offensive events by the same factor.
So, if you want to know what a player would do in a truly neutral context, you need to adjust his singles, doubles, walks,
etc. separately.
But if you are measuring value, that is not at all what you want to do. A player's actual value is almost solely a function
of his runs and outs. The park factor allows us to convert the runs and outs to wins, but that is all they should do. In
a value method, the park factor's only purpose is to state the true value of the player's runs. I am struggling a bit with
how to put this in writing clearly, but it is clear in my mind.
A related question is whether or not to use separate factors for left-handed and right-handed batters. The answer: sure,
you can do it if you want ability, but if you want value, absolutely not. The reasoning here is exactly the same as the reasoning
for runs v. components.
ADJUSTMENT FOR TEAM
The title is vague, but that's because I'm tying two ideas under it. The first is Bill James work in the Historical Baseball
Abstract. Instead of using park factors and adjusting the player's performance(or the other option, adjusting the league
so that they played in the same park as the player), he simply eschewed the league altogether and evaluated the player's performance
in the context of the runs scored and allowed by his team. This is an implicit park factor--the RPG in Rockies games is higher
then the RPG in Padres games. James also indicated that he liked it better, saying that the player ultimately performs in
that context. A Dodger playing in San Francisco's real value to his team's winning does not depend on what a Brave is doing
in Chicago. This approach is inherently for a value measure, and personally, I feel that it has a lot of merit.
Then there is what has traditionally been done in the Pete Palmer PFs which have appeared in The Hidden Game of Baseball
and Total Baseball. In addition to quantifying the park's effect on run scoring, Palmer publishes a separate PF for
the hitters and pitchers. In the BPF(Batter's Park Factor) for instance, there is an adjustment made for the fact that the
batters did not face the team's pitchers. In order to facilitate a truly equal comparison with the other batters in the league,
the quality of pitchers faced should be equal.
This ties into the James approach, because my take on it(at least value-wise) is, "Why does it matter?" Sure, it's easier
if you don't have to face your teammates who lead the league in ERA. But the wins and losses you have produced were done
solely against your competitor on the field. If the Yankees have good offense and pitching, they will benefit from not having
it square off--and they will win real games as a consequence. Value must be grounded in real wins and losses.
But if you are going for ability, the idea of adjusting for quality of opponents is a good one. However, I think it is best
left as a separate adjustment, because lumping it in with the park factor obscures which part of the adjustment is for the
park and which part is for the not facing teammates factor. But it's Pete's book and if that's how we wants to do it, more
power to him. This is a style issue, not a methodology one.
IN AN IDEAL WORLD
Here I will start to discuss the choices I have made in calculating the PFs published here. I wanted to save this until the
end, but I need to clarify my terms before writing the next section. Usually, we take a PF and apply it to a player's complete
batting line. But what if the player has played more on the road then at home? Isn't he unfairly docked by applying a PF
that assumes equal play at home and on the road to his road-heavy stats? Yes. And it's not just "unfair", it also goes against
value logic. If the performance actually occurred on the road, it should be evaluated in that context.
The Big Bad Baseball Annual followed this approach in their later editions. They calculated home and road park factors
for each team(the road factors for each team are not 1.00 as they are for the league because the team does not play on the
road in its home park, while all the other teams do. They should be around 1, though) and adjusted the players home and road
stats separately, then adding them back up. This is a great approach, if you have the data, time, and sanity. Heck, if you
have the player's stats by each specific park rather then just home/road, go for it. But you can see that this would get
very complicated very quickly.
And when you actually calculate the PF, most people will just weigh the RPG at home against the RPG on the road. Well, the
more advanced approaches will, instead of comparing home context to road context, compare home context to league context.
If there are T teams in a league, the aggregate league context is based 1/T% on each park. To sum for a team, (T-1)*Road+Home((T-1)/T%
is road, 1/T% home). In a league with 14 teams, 13/14 of the league context is based on a given team's "road" park; 1/14
on its home park.
But then, don't teams play unbalanced schedules? If the Orioles play 10% of their road games at Yankee Stadium and just 4%
at Safeco Field, shouldn't their road context be based 10% on Yankee Stadium and 4% on Safeco rather then 1/13=8% on both?
Sure. But it's just more complexity with a limited effect on the ultimate answers the sabermetric methods will yield.
Another problem with these kind of adjustments(although more so left/right adjustments) is the sample size. And so the question
also arises, how many years data should you use? Some people use just one year, others three, others five. There are arguments,
and good ones, to be made for all viewpoints. Pros for one year include taking into account conditions that change from year
to year such as the weather. Also, while the "road" context used in figuring the PF is assumed to be constant, it actually
varies based on the other parks in the league. When Coors Field opened, teams scored more runs on the road then the previous
year because they were now playing 8% of their road games at a launching pad. A one year PF keeps the other parks constant.
On the other hand, multiple year PFs give you more data and an increased sample size. Personally, I side with multiple year.
With regression thrown in.
Any time we observe something, it is just a sample and not the true frequency or proportion the event should occur at. By
taking our observed values and regressing them towards the mean, we can increase the reliability and lessen the chance that
we "overadjust". On the other hand, sometimes we might lose a real effect because it did not occur due to chance. But on
the whole, we will be better off. The degree of regression that is appropriate is a totally different subject and one that
deserves more study and real statisticians(unlike myself) working on it.
And then there is the problem of how to express the PF when we are done. Let's say that by whatever approach, we determine
a home context RPG of 11 and a road context RPG of 10. A quick PF would be 11/10=1.10. The park increased offense by 10%.
But that is only for performances that occurred in the park. The road numbers should not be adjusted at all. So we can't
just adjust a player's RC down by 10%(unless it is only his home RC). To apply it to the whole stat line, we need to water
it down by 50%(since 50% of the games are played at home). So to apply it, we need (1.10+1)/2=1.05, or analogously, (1.10-1)/2+1=1.05.
WHAT SHOULD THE DENOMINATOR BE?
In this section, I will focus solely on run park factors, since that is what I use and I have not really given deep thought
to proper denominators for other events. Some people do park factors for singles and other hits in play with a denominator
of balls in play. Others do them with plate appearances. Voros McCracken uses all sorts of different denominators. I'm
not going there, right now.
So, for a run PF, the denominator should be...outs. Or innings or games, which are essentially equivalent to outs. Why outs?
Well, let's take a player who creates 100 runs and makes 350 outs in 600 plate appearances, but plays in a park with a 1.05
PF(increases offense by 5%). Let's say that the 1.05 is based on using R/PA as the rate stat for the teams in calculating
the PF. What a 1.05 PF means, then, is that his R/PA has been inflated by 5%. So his 100/600=.167 R/PA should be adjusted
to .167/1.05=.159. But then we are going to use R/O in the valuation method(at least in my stats here). So now we have adjusted
the runs, but the out rate is also affected by the park(the OBA at Coors is higher then the OBA at Dodger Stadium). So now
we need an outs park factor too. We need to change the number of outs he's made since that effect was not considered in the
park factor.
But wait a second, aren't outs constant? If you play at Coors Field, you still have 27 outs in a game and you use them all.
Outs are constant across parks. At first, this seems like it is really going to complicate things, but luckily, the math
works out well. Say the 1.05 was actually a PF based on team R/G(which is essentially equivalent to R/O). The 1.05 PF based
on R/G means the player's R/O needs to be divided by 1.05. So it goes from 100/350=.286 to .286/1.05=.272. Now, how many
runs has he created? .272 runs/out*350 outs =95.24 runs. And he still has 350 outs. But 95.24 is also equal to 100/1.05.
In other words, you can apply the R/O PF directly to a player's RC total and still get the correct rate stat answer.
How I Calculate PF
I lifted the method here essentially from Craig Wright in The Diamond Appraised, with regression factors published
by MGL thrown in. It is fairly straightforward and most of the concepts involved have been touched on above.
First, I use up to five years of data(so this year, it is 2000-2004). If the park has not been used for all five years, or
has had major renovations, I use the appropriate smaller time length. One departure is the Montreal/San Juan Expos. Although
the Expos have played at Olympic Stadium for many years, playing twenty games or whatever in Puerto Rico has really thrown
a monkey wrench into things. One approach(and perhaps the best) would be to use five years data for Olympic Stadium and the
two years data for San Juan, and weight them by the percentage of games played at each venue. However, I have chosen to just
look at the two seasons separately and with the stadiums combined(I'm pretty sure they played the same number of games in
San Juan in 03 and 04 so we don't have to worry about the percentages for each season being different. Next year we won't
have to worry about this, but we will have to worry about a one year park factor in Washington).
I have applied regression as well, but we'll touch on that last. I have taken RPG at home and RPG on the road. Call these
H and R respectively. Then, the raw PF is H*T/((T-1)*R+H). (NOTE 11/2005: This formula previously lacked the "*T" in the
numerator before a reader pointed out the error) All this does is make the "road context" 13/14 of the opposing stadium and
1/14 of your stadium, as it is for the league(assuming 14 teams here; T is # of teams in the league). Then, take the raw
PF, add 1 to it, and divide by 2. This just makes it applicable to composite stats rather then just home stats. Call this
intermediate PF iPF.
Then we apply regression. I have used weights suggested by MGL in a post on baseballboards.com in 2000. These weights seem
arbitrary, but maybe he had a good study to base them on. Anyway, I think their reasonable and that's why I use them. The
Final PF is:
1-(1-iPF)*X
Where X=.6 for 1 year, .7 for 2 years, .8 for 3 years, and .9 for 4+ years.
To give you an idea of the effect of this, say there's a park with an observed iPF of 1.25(this is a Coors-type effect). If
that was observed over 1 year, we'd use a PF of 1.15. For 2 years, 1.175, for 3 years, 1.20, and for 4 years, 1.225.
UPDATE: 10/05...For the 2005 park factors, I am adding a home run park factor. It is calculated in exactly the same way as
the run park factor. This means that it is based on HR/G, which as I explained above, is roughly proportional to HR/Out.
While R/Out is a proper measure, HR/Out is not. HR/PA would be a much better measure of HR ability. But when you consider
how this specific park factor is intended to be applied, it is not problematic. First, it is based on the team data, so the
kind of problems you would have with HR/Out applied to Barry Bonds do not exist. But more importantly, they are more valuable
if you want to figure out how many homers a player would hit in a different context. If a park reduces outs and increases
homers(hello, Coors), the HR PF may overstae the increase in home runs caused by the park. But a player who moves to Coors
will still hit more homers both due to the true HR factor AND the fact that less outs for the team equals more PAs for him.
So I think that the PF based on HR/G is actually more useful for a fantasy player. Also, the differences between the HR/G
and HR/PA PFs would be very small--most parks are close to 1.00 of course anyway.
I have also added a spreadsheet with five year PFs for all teams, 1901-2004. The guiding philosophy was to try to include
as much data as possible. If there are five possible years of data to be used for a park, they will all be used, even if four
of the seasons were in the past or in the future. The source of the raw data was KJOK’s excellent park database.
I treat a park as new if there are major changes to the dimensions, but I did not by any means do a complete historical survey
to find out when those changes have taken place, so some that probably should have been treated differently are not. If you
have specific data on when a change should have (or shouldn’t have) been made, feel free to leave a comment and I will
try to incorporate these changes when I update the chart some time in the future.
Additionally, when a team moves, and a new team immediately moves in (for example, the Senators of ’60 and ’61),
this is treated as a new team. Also, in cases in which teams have played a significant (which I defined as around ten or more)
number of games in a different stadium in the same year, those years are treated as being a new park (an example is the Dodgers
playing games in New Jersey the two years before they moved from Brooklyn). Whenever a “new park” of this sort
is established, when the old order is restarted it is treated as another new park.
The reason the park factors are only shown through 2004 is that my ideal data set is two previous years, the year in question,
and two future years. For most of the parks active in 2005, we will after 2007 be able to fill this dataset, and so I don’t
want to publish a park factor now and change it later. However, there are a few parks where the 2004 or 2003 factors are not
yet settled because they are new and there are not yet five years of data available. In these cases, I have listed a PF but
marked it as one that will change in the future.
Now I will give an example of how I chose the years to be considered in figuring the PF. Suppose we look at the Diamondbacks,
who have played in Bank One Ballpark since 1998. In 1998, we have no previous data, but we do have four future years of data
we can use, so the sample is 1998-2002. For 1999, we have one previous year, so we take three future years, and get 1998-2002.
For 2000, we have two previous years and two future years, so we use 1998-2000. In 2001, we use the two previous years (1999
and 2000), and two future years (2002 and 2003), making the total sample 1999-2003.
Let’s also consider the end of the Braves’ tenure in Fulton-County Stadium. The last season there was 1996. For
1994, we have two previous years (92 and 93) as well as two future years (95 and 96), so we use 1992-1996. For 1995, we have
just one future year, so we use three previous years, and also use 1992-1996, and the same for 1996.
2003 Park Factors
2004 Park Factors
2005 Park Factors
2006 Park Factors
2007 Park Factors
1901-2004 Park Factors
|