Base Runs

Base Runs

The link below is for an article that was printed in the August 2001 edition of By the Numbers, the newsletter of SABR's Statistical Analysis Committee, with minor modifications, about David Smyth's outstanding Base Runs method. Below the link is some additional information about BsR.

Base Runs: A Promising New Run Estimator

Breaking Down BsR

It is sometimes useful to write a stat like Base Runs in rate form. It helps greatly in making the Theoretical Team equations, for one thing, and it is also useful to be able to write BsR completely in terms of BA, OBA, SLG, and HR/PA. To do this, you need to start with each component and divide it by PA. So, A/PA, B/PA, C/PA, and D/PA. (Since I am using a basic version of Base Runs, you need PA=AB+W). You can call these, resepectively, Runners On Base Average(ROBA), Advancement Factor(AF), 1-OBA, and HR/PA. Then

BsR/PA = ROBA*AF/(AF+1-OBA)+HR/PA

For the Basic version I use, these are the equations for each component:

ROBA = (H+W-HR)/(AB+W) = OBA-HR/PA

AF = (2*TB-H-4*HR+.05*W)*.78 = ((2*SLG-BA)*(1-OBA)/(1-BA)-4*HR/PA+.05*(OBA-BA)/(1-BA))*.78

1-OBA = 1-(H+W)/(AB+W)

HR/PA = HR/(AB+W)

In the Base Runs article linked above, I gave the equations that I use for each factor in this basic version. The B multiplier is based on the composite MLB stats of 1946-1995. In this period, the average for each components are:

ROBA AF OBA HR/PA

.303 .308 .325 .0222

You can use these to put together the Theoretical Team factors. The TT concept, which I will not explain here in every detail, is that since Base Runs(or Runs Created) is a run estimator devised for estimating team runs, there is an interactivity between the values of the offensive events. As the offensive production increases, the value of each event goes up(with the exception of the special case, HR). So applying BsR to Babe Ruth gives him an unfair advantage because he is not playing on a team by himself; he is playing on a team with 8 other players. So the TT formula puts the the player on a team with 8 average players. So, we assume that each player on the theroetical team gets the same number of PA as our player. So the teams new A factor can be calculated as (A+LgROBA*PA*8), where A is the individual's A factor. So you apply this technique to the B, C, and D terms, using the long term averages above(you really should have a seperate version each year, but small changes in ROBA, AF, etc. don't significantly change the results of the formula).

Then, to see how much the player has helped this team, we compare him to a team of 8 average players in his number of PA each. If we wanted to compare the player to the league average, we would compare him to 9 average players. If you work all this out and simplify, you get this equation for TT BsR, which I like to call Individual Base Runs(IBR).

IBR = (A+2.42PA)(B+2.46PA)/(B+C+7.86PA)+HR-.76PA

Lest it seem as if I am taking credit for coming up with all of this, the pioneering TT work was done by Dave Tate and Bill James, and the application of the TT concept to BsR was also the work of David Smyth.

Stolen Base BsR

It is useful and necassary to get some more categories into a Runs Created formula, and so here we'll put SB and CS in(this is again based on Smyth's work). The other categories we could add, like SF, SH, and DP, I choose to ignore. For one, they are very situation dependent and therefore I'm not 100% comfortable in including in an individual formula, and secondly and more importantly, I am lazy and don't want to deal with them. Anyway, for BsR including SB:

A = H + W - HR - CS

B = (2*TB - H - 4*HR + .05*W +1.5*SB)*.76

C = AB-H

The IBR formula for the standard league is:

IBR =(A+2.34PA)(B+2.58PA)/(B+C+7.98PA)+HR-.76PA

ROBA and AF are no longer the rate stats; I call these AROBA and AAF for "advanced". Anyway, the long term averages are:

AROBA AAF OBA HR/PA

.293 .323 .325 .0222

Full BsR

Here is a version of the BsR formula that you can use if you have all of the minor(SH, SF, DP, etc.) offensive stats. It is not as clean and nice looking as the other versions on this page, but there needs to be more of a give-and-take between the various events when you include the other stats. It is also not straightforward as to which events should be placed in which factor(s). I took the convention that A is final baserunners; baserunners less those who we know have been thrown out on the bases or taken out on a DP. Everything goes in B to balance everything out and produce good linear weights, while C is batting outs. D remains home runs. There are other ways to define these terms and Smyth, TangoTiger, and Robert Dudek have all done these in different ways then I have. There are certainly arguments to be made for all of the differnt approaches, but a discussion of that will have to wait for another day.
A = H + W + HB - HR - CS - DP
B = .777S + 2.61D + 4.29T + 2.43HR + .03(W + HB - IW) - .747IW + 1.30SB + .13CS + 1.08SH + 1.81SF + .70DP -.04(AB-H)
C = AB - H + SH + SF

If you want to include strikeouts, they go in this B factor which is coupled with the A and C factors given above: B = .781S + 2.61D + 4.28T + 2.42HR + .034(W + HB - IW) - .741IW + 1.29SB + .125CS + 1.07SH + 1.81SF +.69DP - .029(AB-H-K) - .086K

Finding the B Multiplier

The B multiplier is designed so that the BsR formula will produce the correct number of runs for the entity you are using. This is because A as baserunners, C as outs, and D as Home Runs, all are straightforward and obvious formulas.

You can calculate, based on A, C, and D, the actual B factor required to equate BsR with R, by this formula: (R-D)*C/(A-R+D). What can you do with the actual B value? For one thing, if you already have a set formula for B(ignoring the multiplier), you can divide actual B by estimated B to get the correct multiplier. Another thing you can do is run a regression to find weights for TB, H, etc. by using those stats to predict Actual B, or use other approaches like trial and error, etc. All of these approaches had a role in finding the B component used in the official versions of BsR.

An alternate way to find B is to calculate Z=(R-D)/A, then B=Z*C/(1-Z). It is the same thing, and longer and more complicated, but it is equivalent. (I include it because it was the way I did it until I took the time to work out the algebra to derive the other formula).

Building the TT BsR Formula

Here are the technical steps to be building the TT formula. These are not very interesting for most people, but hard core sabermetricians may find them useful(although hard core sabermetricians probably already know how to do it themselves):

IBR can be written as:

(A+X*PA)(B+Y*PA)/((B+Y*PA)+(C+Z*PA))+HR+T(PA)-(V)PA which simplifies too:

(A+X*PA)(B+Y*PA)/(B+C+(Y+Z)PA)+HR-(V-T)PA

where X is the remainder of team ROBA

Y is the remainder of team AF

Z is the remainder of team 1-OBA

T is the remainder of team HRPA

V is the R/PA for the comparison lineup multiplied by the number of players

in the comparison lineup

OK, since we always add the player to a team with 8 average playes:

X = LgROBA*8 Y = LgAF*8 Z=(1-LgOBA)*8 T = LgHR/PA*8

Depending on what baseline we use though, V will vary. For absolute runs, we compare the player to a team with 8 average hitters. For runs above average, we compare the player to a team with 9 average hitters. For runs above replacement, we compare the player to a team with 8 average hitters plus one replacement level hitter. So, it is very straightforward to find V for absolute: 8*LgBsR/PA. For average, V = 9*LgBsR/PA.

For replacment, we need to first set a replacement level, and then determine what ROBA, AF, OBA, and HRPA a replacement player will have. I assume 25 batting outs(AB-H)/G, and use BsR/PA to calculate the R/G for the league. (BsR/PA)/(1-OBA)*25, since BsR/O = (BsR/PA)/(1-OBA). (Keeping in mind that BsR/PA = ROBA*AF/(AF+1-OBA) + HRPA). Then, I assume the replacement rate is 1 run/game below average, so I take that R/G, subtract 1, and divide by 25. This is the replacement player's R/O. In the standard league we are using, the BsR/PA = .117, R/O = .173, and RepR/O(R/O for the replacement) = .133. Then we need to find the value, X, by which the each component stat for the league(ROBA, AF, OBA, and HRPA) needs to be deflated by for R/O to equal .133. We multiply each term in the BsR/O formula by X. This, when simplified, gives this equation:

RAX^2/(1+X(A-O)+HX)/(1-OX) = Rep R/O

R is LgROBA, A is LgAF, O is LgOBA, and H is LgHRPA. I have no idea how to solve for X by hand, but my TI-83 calculator will do it, and it gives .89 for the standard league(this will all vary based on the league offensive levels, and of course how you personally choose to define replacement rate). Any way, we then multiply each component by .89 to find we expect our replacement to hit:

ROBA AF OBA HRPA

.269 .274 .289 .02

So this gives him a BsR/PA of .095. We then calculate the V value for the replacement baseline as 8*LgBsR/PA+RepBsR/PA. Here is a chart showing the values you need to fill in for the TT components at each baseline in the standard league:

BASELINE X Y Z T V V-T

Absolute 2.42 2.46 5.40 .178 .937 .759

Average " " " " 1.054 .876

Replacement " " " " 1.031 .853

If you want to get more complex, there is something that we have failed to adress. That is that if you really add a player to a team, he will change the number of PA everyone in the lineup gets. A player with a higher OBA than his teammates will generate more PA; one with a lower one will generate less. In the TT formula above, we have held PA constant. What if we let them vary? We can calculate the OBA the team would have with the player as 8/9*LgOBA+1/9*OBA. Call this Q. Then, figure (1-LgOBA)/(1-Q). Call this PAR of PA-added ratio. Then, multiply every individual term(the new A, the new B, the new C, and the new D), by PAR, and proceed as usual.

Is this worth it? Who knows. Some of these bells and whistles might wash out when you convert them to win values. Maybe they don't. A straight linear system, though, might be correct, and it will help you keep your sanity.

Simplified TT BsR

What does all that do for you? Believe it or not, it leaves you with almost the same result you would have gotten if you took 1/9 of the player's straight BsR and 8/9 of the player's linear BsR(based on the reference team). This is for the absolute version only, but of course if you have the absolute RC figure, you can calculate the values above other baselines without going through all the voodoo above.

Fundamental Structure of BsR

The fundamental structure of BsR is its key asset. That fundamental structure is based on the simple, undeniable truth that runs scored = baserunners*% of baserunners who score + home runs. "Basrunners" does not include home runs. Anyway, in BsR, the A factor represents baserunners and the D factor represents home runs. The % of baserunners who score, which we'll call score rate, is estimated as B/(B+C), where B is advancement and C is outs.

Other run estimators are not backed up by a fundamental theory of how runs are scored. Runs Created's downfall is its failure to account for the unique nature of the HR(that it always produces at least one run, and if it occurs by itself, it will produce only one run). Static LW formulas fail to account for the fact taht the value of each event varies based on the context. BsR is based on a true equation of how runs are scored. That does not mean, though, that BsR is the one true correct run estiamtor by any stretch. The equation of B/(B+C) to estimate score rate has good empirical accuracy, but also has been found to not work very well in some circumstances(such as OBA between .500 and .800--see Tango's article on Primer about this). Maybe score rate should be estimated in a totally different way. But the structure of the BsR equation is sound. If we want a better run estimator, we need a better estimator of score rate.

Linear BsR

You can figure how a non-linear RC formula values each event in the context you are interested in(it can be the league, a specific team, or even a hypothetical lineup of the same player over and over again). All you have to do is calculate BsR for the entity, and then add one single, recompute BsR, and subtract the first figure. This is the value of one additional single. Then you do the same with every other event, and you'll have LBsR. You have to be careful to account for everywhere the event is involved; for example, a single not only adds a hit but also a Total Base and an At Bat. If you run the LBsR for the long-term stats, you get these values:

LBsR = .48S+.81D+1.14T+1.50HR+.32W-.096(AB-H)

LBsR(sb) = .47S+.77D+1.07T+1.45HR+.33W+.23SB-.41CS-.093(AB-H)

Of course, you could add something other than one. You could subtract one, or add 10, or add 15. The further you get away from 0, the more the results will vary. Adding 1000 singles will have a much different effect, even per single, then adding 1 single. Really, as Tango has pointed out, we want to get as close to adding 0 singles as possible. Adding .00001 singles changes the run enviornment and the values of the other events very little, and that is what we are looking to do. It is sort of like a limit in calculus. Actually, I guess that's exactly what it is. We want to find the limit of (new BsR minus old BsR) divided by X, as X approaches 0, where X is the number of the event that we are adding. Somebody who knows a lot about calculus could probably tell me if I'm right about that, and if so, come up with a formula to calculate the limit precisely instead of having to do trial and error in a spreadsheet.

I have included a spreadsheet which runs through this approach for the 1979 Pirates. You can change the data in cells B2 to G2 to whatever you want to do this with other entities. Anyway, I show the LW generated by adding 10 of each event, 1 of each event, .1 of each event, etc. and the same for -10, -1, -.1, etc. I have highlighted in pink the positive and negative points at which the convergence, the limit, occurs. If you go past that(I put it at one ten-millionth, 10^-6), the values start fluctuating again. My suspicion is that this is because of the spreadsheet not having perfect accuracy, internal rounding and the like, but I could be wrong. Anyway, you can see there is not a lot of difference. The +10 weight for a Pirate single for instance is .4898824, the +1 is .4892998, and the limit is around .4892350. So you really don't need to do that, but it is nice to illustrate the property.

Added 4/7/04: Using calculus, you can figure this precisely using partial derivatives. The value of the single for instance is equal to the partial derivative of the BsR function with respect to singles. You can still do this even if you don't know calculus, because the math works out simple with BsR. The formula winds up being:

((B+C)*(A*b+B*a)-(A*B)*(b+c))/((B+C)^2)+d

Let A, B, C, and D be the respective total factors for the entity you are interested in. Let a, b, c, and d be the A, B, C, and D coefficients of the event you are interested in. That's it. Thank goodness all of the formulas for the pieces of BsR are linear.

There is a spreadsheet linked at the bottom that shows this. It is based on Tango's full BsR which is available at:

http://www.tangotiger.net/bsrexpl.html

If you don't want to deal with a category, just set the coefficients to 0. You can change the coefficients for the other events to use any BsR equation you want all with this spreadsheet. Of course you can also change the "#" column, which is the frequency of the event for the dataset you're using. Enjoy.

Matching LW Values

Based on the formula above to calculate the Linear Weight value of a certain event using BsR, you can also fix the B coefficients so that they produce desired LWs. For example, on my LW page there is the ERP formula that I use, based on 1951-1998 composite major league data. Suppose I want to force my BsR formula to produce the same LW as are used in ERP. How do I go about doing this?

Well, first, I have to clearly define which events are in the A, C, and D factors, and what coefficient they have there. For my case, I will use S, D, T, HR, W, and O as the only events. S, D, T, and W each have a coefficient of 1 in A; O has a coefficient of 1 in C; and HR has a coefficient of 1 in D.

Now, I we need to calculate the A, C, and D factors for the entity I am working with(in my case, all teams 1951-1998). Then, I use these to calculate what we will call B--the actual B value required for BsR to equal runs scored. The formula for ActB is (R-D)*C/(A-R+D), where R is the actual runs scored we want to match.

So, now we have everything we need. a, b, c, and d are still the coefficient for the given event in the respective factors. And we can calculate b as:

B = ((B+C)^2*(L-d)-B^2*a-B*C*a+A*B*c)/(A*C)

Voila. So, let's look at my ERP equation. It is (TB+W+.5H-.3(AB-H))*.324, which as LW for S, D, T, HR, W, O is .486, .81, 1.134, 1.458, .324, -.0972. The B that I use for BsR((2TB-H-4HR+.05W)*.78) is:

B = .78S+2.34D+3.9T+2.34HR+.039W

Now, with all of this data, we can force the LW values. When we do this(which you can do with the spreadsheet linked at the bottom of the page, the same one that gives the actual LW values), it seems to give a result that's decent to .001 or so. It might be rounding error, or it might be something else, but either way, it's pretty close. So, to match the linear weight values I wanted, my B would be be:

B = .833S+2.360D+3.888T+2.159HR+.0692W-.010(O)

Yes, the outs have to be included as well. That's kind of cumbersome if you don't want outs in B, but it's necessary to force the values. Are you sufficient confused yet? I am.