Vandermonde’s Identity from the Generalized Product Rule

Vandermonde’s Identity, \displaystyle \sum_{k=0}^r \binom{n}{k} \binom{m}{r-k} = \binom{n+m}{r}, is one of the more famous identities involving the binomial coefficients.  A standard way to prove it is with the following combinatorial argument.

How many ways are there to choose a committee of size r from a group of n men and m women?  By definition of the binomial coefficient, the answer is \binom{n+m}{r}.  Alternatively, condition on the number k of men on the committee.  There are \binom{n}{k} ways to choose the men, and then there are \binom{m}{r-k} ways to choose the remaining r-k members of the committee from the m women.  Summing over all possible values of k gives the left side.

In this post we’re going to give a much less standard proof of Vandermonde’s Identity — one via the generalized product rule for derivatives.  In addition, while the combinatorial proof requires n and m to be natural numbers, our proof holds for all real numbers.

The generalized product rule, sometimes known as Leibniz’s Generalized Rule, gives a formula for the nth derivative of the product of two functions.  Here it is:

\displaystyle (fg)^{(n)}(x) = \sum_{k=0}^n \binom{n}{k} f^{(k)}(x) g^{(n-k)}(x).

This formula can be easily proved by induction.

Let f(x) = x^n, and let g(x) = x^m.  We know that the rth derivative of a power function like x^n is n^{\underline{r}} x^{n-r}, where n^{\underline{r}} is the falling factorial n(n-1) \cdots (n-r+1).  Now, let’s look at the rth derivative of f(x) g(x) = x^{n+m}.  As we just argued, this is (n+m)^{\underline{r}} x^{n+m-r}.  By the generalized product rule, then, we have

\displaystyle (n+m)^{\underline{r}} x^{n+m-r} = \sum_{k=0}^r \binom{r}{k} n^{\underline{k}} x^{n-k} m^{\underline{r-k}} x^{m-r+k}

\displaystyle \implies (n+m)^{\underline{r}} x^{n+m-r} = \sum_{k=0}^r \frac{r!}{(r-k)! k!} n^{\underline{k}} m^{\underline{r-k}} x^{n+m-r}

\displaystyle \implies \frac{(n+m)^{\underline{r}}}{r!} x^{n+m-r} = \left(\sum_{k=0}^r \frac{n^{\underline{k}}}{k!} \frac{m^{\underline{r-k}}}{(r-k)!}\right) x^{n+m-r}.

The coefficients of the two expressions here must be equal in order for the expressions themselves to be equal.  In addition, using the definition of the generalized binomial coefficient, we have \binom{n}{k} = \frac{n^{\underline{k}}}{k!}.  (This allows binomial coefficients to be defined for values of n that are not natural numbers.)  All of this gives us Vandermonde’s Identity,

\displaystyle \binom{n+m}{r} = \sum_{k=0}^r \binom{n}{k} \binom{m}{r-k}.

Posted in binomial coefficients, calculus, combinatorics | Leave a comment

Combinatorial Proofs for the Binomial and Alternating Binomial Power Sums

In this post I give combinatorial proofs of formulas for the binomial and alternating binomial power sums: \displaystyle \sum_{k=0}^n \binom{n}{k} k^m and \displaystyle \sum_{k=0}^n \binom{n}{k} k^m (-1)^k.

Here’s the first.

Identity 1.

\displaystyle \sum_{k=0}^n \binom{n}{k} k^m = \sum_{j=0}^m \left\{ {m \atop j} \right\} \binom{n}{j} j! 2^{n-j}.

Proof.  Both sides count the number of functions from \{1, 2, \ldots, m\} to subsets of \{1, 2, \ldots, n\}.  For the left side, condition on the number of elements in the codomain.   If there are k elements in the codomain, there are \binom{n}{k} ways to choose those k elements from the n possible.  Once the k elements in the codomain are chosen, there are k choices for f(1)k choices for f(2), and so forth, for a total of k^m choices for the function values from f(1) to f(m).  Summing over all possible values of k gives the left side.

For the right side, condition on the number of elements in the image.  If there are j elements in the image, there are \binom{n}{j} ways of choosing these j from the n possible.  Then there are 2^{n-j} ways to complete the codomain, as any of the remaining n-j elements in \{1, 2, \ldots, n\} could either be in the codomain or not.  In addition, there are \left\{ {m \atop j } \right\} j! onto functions from a set with m elements to a set with j elements.  Finally, sum over all possible values of j.    \square

And the second.

Identity 2.

\displaystyle \sum_{k=0}^n \binom{n}{k} k^m (-1)^k = (-1)^n \left\{ m \atop n \right\} n!.

Proof.  We use the same interpretation as in the first proof.  The left side is the difference between the number of functions from \{1, 2, \ldots, m\} to subsets of \{1, 2, \ldots, n\} with an even-sized codomain and the number of such functions with an odd-sized codomain.  We construct a sign-reversing involution on the set of functions from \{1, 2, \ldots, m\} to \{1, 2, \ldots n\}.

Given a function f: \{1, 2, \ldots, m\} \mapsto S, where S is a subset of \{1, 2, \ldots, n\}, let y be the largest-numbered element of \{1, 2, \ldots, n\} that is not in the image of f.  Define \sigma(f) to be the function g: \{1, 2, \ldots, m\} \mapsto T, where T \subseteq \{1, 2, \ldots, n\} such that f(x) = g(x) for all x \in \{1, 2, \ldots, m\} but that T = S \cup \{y\} if y \not\in S and T = S - \{y\} if y \in S.  In other words, f and g are the same function except that the codomain of exactly one of them contains the largest-numbered element not in their common image.  From this definition we can see that \sigma is both sign-reversing (as it maps functions with even-sized codomains to those with odd-sized codomains and vice versa) and an involution.  Thus the functions for which \sigma is defined are canceled out in the sum on the left.  The only functions for which \sigma is not defined are those for which y does not exist.  These are the onto functions from \{1, 2, \ldots, m\} to \{1, 2, \ldots, n\}.  There are \left\{ {m \atop n } \right\} n! of these, and their sign is determined by whether n is even or odd.  \square

For proofs of these two identities by using the formulas for \displaystyle \sum_{k=0}^n \binom{n}{k} k^{\underline{m}} and \displaystyle \sum_{k=0}^n \binom{n}{k} k^{\underline{m}} (-1)^k and then using the Stirling numbers of the second kind to convert from falling factorial powers to ordinary powers, see my paper [1].


[1]  Michael Z. Spivey.  “Combinatorial Sums and Finite Differences.”  Discrete Mathematics 307 (24): 3130-3146, 2007.

Posted in binomial coefficients, combinatorics, Stirling numbers | Leave a comment

Independence of the Range and Minimum of a Sample from an Exponential Distribution

A few years ago I answered a question on math.SE about the distribution of the sample range from an exponential (1) distribution.  In my answer I claim that the range and the minimum of the sample are independent, thanks to the memoryless property of the exponential distribution.  Here’s a direct derivation of the independence of the range and minimum, for those (including the graduate student who contacted me about this last month!) for whom the memoryless justification is not sufficient.

Let Y_{(1)}, Y_{(2)}, \ldots, Y_{(n)} be the order statistics from a sample of size n from an exponential (1) distribution.  The joint distribution of Y_{(1)} and Y_{(n)}  is given by

\begin{aligned} f_{Y_{(1)}, Y_{(2)}}(y,z) &= \frac{n!}{(n-2)!} e^{-y} e^{-z} \left( (1 - e^{-z}) - (1-e^{-y}) \right)^{n-2} \\ &= n(n-1) e^{-y} e^{-z} (e^{-y} - e^{-z})^{n-2}, \text{ for } 0 < y < z. \end{aligned}

Now, let’s do a double variable transformation.  Let R = Y_{(n)} - Y_{(1)}; i.e., R is the sample range.  Let S = Y_{(1)}, the minimum.  Then Y_{(n)} = R + S and Y_{(1)} = S.  The Jacobian of the transformation is \begin{vmatrix} 1 & 1 \\ 0 & 1 \end{vmatrix} = 1.  Substituting, we have

\begin{aligned} f_{R,S}(r,s) &= n(n-1) e^{-s} e^{-r-s} (e^{-s} - e^{-r-s})^{n-2} \\ &= n(n-1) e^{-2s} e^{-r} (e^{-s})^{n-2} (1 - e^{-r})^{n-2} \\ &= (n-1) e^{-r} (1-e^{-r})^{n-2} n e^{-ns}, \text{ for } r, s > 0.\end{aligned}

Since f_{R,S}(r,s) factors into a function of r times a function of sR and S are independent.  (We have f_R(r) = (n-1) e^{-r} (1-e^{-r}), r > 0; and f_{Y_{(1)}}(y) = n e^{-ny}, y > 0.)

Posted in probability, statistics | Leave a comment

A Double Euler-Mascheroni Sequence

What is the value of the sum \displaystyle \sum_{i=1}^n \sum_{j=1}^n \frac{1}{i+j}?  Furdui poses this as Question 1.28 in his book Limits, Series, and Fractional Part Integrals [1, p. 5], where he calls the sum a “double Euler-Mascheroni sequence.”  His solution converts the double sum into an expression containing an integral and an infinite series, and then he evaluates those.  In this post I’ll give a different derivation of Furdui’s expression for the value of the double sum – one that uses only standard summation manipulations.

Let \displaystyle H_n denote the nth harmonic number: \displaystyle H_n = \sum_{k=1}^n \frac{1}{k}.  The following lemma is proved in Concrete Mathematics [2, pp. 40-41] and is a straightforward application of summation by parts, but I’ll give a proof here anyway.

Lemma: \displaystyle \sum_{k=1}^n H_k = (n+1) H_{n+1} - (n+1).

Proof:  Swapping the order of summation, we have \displaystyle \sum_{k=1}^n H_k = \sum_{k=1}^n \sum_{j=1}^k \frac{1}{j} = \sum_{j=1}^n \sum_{k=j}^n \frac{1}{j} = \sum_{j=1}^n \frac{n+1-j}{j} = (n+1) \sum_{j=1}^n \frac{1}{j} - \sum_{j=1}^n 1 \\ = (n+1) H_n - n = (n+1) H_n - n + \frac{n+1}{n+1} - 1 = (n+1)H_{n+1} - (n+1).

With the lemma in place, we’re now ready to derive an expression for this double Euler-Mascheroni sequence.

Theorem: \displaystyle \sum_{i=1}^n \sum_{j=1}^n \frac{1}{i+j} = (2n+1) H_{2n+1} - (2n+2) H_{n+1} + 1.

Proof: We have \displaystyle \sum_{i=1}^n \sum_{j=1}^n \frac{1}{i+j} = \sum_{i=1}^n (H_{n+i} - H_i) = \sum_{i=1}^n H_{n+i} - \sum_{i=1}^n H_i = \sum_{i=1}^{2n} H_i - 2 \sum_{i=1}^n H_i.

Applying the lemma, we obtain \displaystyle \sum_{i=1}^n \sum_{j=1}^n \frac{1}{i+j} = (2n+1) H_{2n+1} - (2n+1) - 2 \left( (n+1)H_{n+1} - (n+1)\right) \\ = (2n+1) H_{2n+1} - (2n+2)H_{n+1} + 1.


  1. Ovidiu Furdui, Limits, Series, and Fractional Part Integrals, Springer, 2013.
  2. Ronald L. Graham, Donald E. Knuth, and Oren Patashnik, Concrete Mathematics (2nd ed.), Addison-Wesley, 1994.
Posted in harmonic numbers | Leave a comment

Picking Random Points on a Circle (Not a Disk)

How do you select points at random on a circle?  By “circle” I mean the outside of a disk, not its interior.  In this post I’m going to discuss two methods: (1) Selecting an angle at random on [0, 2\pi] and taking the point that corresponds to that angle, and (2) Selecting an x-coordinate at random and taking the point that corresponds to that x-value.  (Actually, there are two possible points in Case (2), but we can subsequently choose either of those with a 50% probability.)  My main purpose is to show that the second method does not actually select points at random on the circle; instead, that method chooses points more likely to be near the top and bottom of the circle than near the sides.

First, we have to ask what it means, in a precise mathematical sense, to pick points at random on a circle.  I’m going to take this to mean the following:

A process that selects a point at random on a circle is one such that the probability of picking a point from an arc of the circle is proportional to the length of that arc.

For example, there should be a probability of 1/4 of picking a point from the upper-right quadrant of the circle, a probability of 1/2\pi of picking a point from an arc of length 1, and in general a probability of L/2\pi of picking a point from an arc of length L.

For simplicity’s sake, let’s use a unit circle.  Suppose we pick an angle at random from [0, 2\pi] and take the corresponding point on the circle.  The probability of selecting an angle from a particular interval of length L is L/2\pi.  Since the length of an arc of the unit circle is just \theta, where \theta is the angle subtended by that arc, the probability of this process selecting a point from an arc of length L is also L/2\pi.  Thus selecting an angle at random and taking the point that corresponds to that angle is a process that selects a random point on the circle.

Why doesn’t selecting an x-coordinate do the same thing?  Let’s focus on just the upper-right quarter of the circle and select an x value at random from [0,1].  Then we take the point (x, \sqrt{1-x^2}) on the circle.  Since x is related to \theta by the relationship x = \cos \theta, and \cos \theta is decreasing on [0, \pi/2], the probability of the point (x, \sqrt{1-x^2}) ending up in the range [0, \pi/4] of \theta values is the probability that x is chosen from the interval [ \cos(\pi/4), \cos(0)] = [\sqrt{2}/2, 1].  This probability is not 1/2 (as it would be if this process selected a point at random on the unit quarter-circle) but 1 - \sqrt{2}/{2} \approx 0.292.  Thus choosing a random x from [0,1] and taking the corresponding point on the unit quarter circle tends to produce more points near the top of the circle than near the side.

Generalizing from the example, the probability that the x-coordinate procedure produces a point on the circle with an angle in the range [0, \theta] is 1 - \cos \theta.  This means that the probability of the x-coordinate procedure yielding a value in a small range of width \Delta \theta around a fixed \theta value is \frac{d}{d \theta} (1 - \cos \theta) = \sin \theta times \Delta \theta, or \sin \theta \Delta \theta.  Since the sine function increases on [0, \pi/2], the closer \theta is to \pi/2, the more likely it is that the x-coordinate procedure will produce a value near \theta.

Posted in probability, simulation, statistics | 2 Comments

Proof of the Recursive Identity for the Bernoulli Numbers

The Bernoulli numbers B_n satisfy the recursive relationship \displaystyle \sum_{k=0}^n \binom{n+1}{k} B_k = [n=0].  (Here, [P] is the Iverson bracket, where [P] evaluates to 1 if P is true and 0 if P is false.)  The relationship can be used to calculate the Bernoulli numbers fairly easily.  This post gives a proof of this relationship.

The exponential generating function for the Bernoulli numbers is given by \displaystyle \frac{x}{e^x-1} = \sum_{n=0}^{\infty} B_n \frac{x^n}{n!}.  The exponential generating function for the sequence consisting of 0 followed by an infinite number of 1’s is given by \displaystyle e^x-1 = \sum_{n=1}^{\infty} \frac{x^n}{n!}.  Multiplying these together, we obtain, thanks to the multiplication principle for exponential generating functions (see, for example, Concrete Mathematics [1], p. 365)

\displaystyle x = \left(\sum_{n=1}^{\infty} \frac{x^n}{n!} \right) \left( \sum_{n=0}^{\infty} B_n \frac{x^n}{n!}\right) = \sum_{n=0}^{\infty} \left(\sum_{k=0}^{n-1} \binom{n}{k} B_k \right) \frac{x^n}{n!}.

But x has generating function \displaystyle \sum_{n=0}^{\infty} [n=1] \frac{x^n}{n!}.  Thus we have \displaystyle \sum_{k=0}^{n-1} \binom{n}{k} B_k = [n=1].  Replacing n-1 with n, we have the formula we were trying to prove:

\displaystyle \sum_{k=0}^n \binom{n+1}{k} B_k = [n=0].

(As a side note, the Concrete Mathematics reference, p. 365, derives the exponential generating function for the Bernoulli numbers from the recurrence relation.)


1.  Ronald L. Graham, Donald E. Knuth, and Oren Patashnik.  Concrete Mathematics, 2nd edition.  Addison-Wesley, 1994.

Posted in Bernoulli numbers, real analysis, recurrence relations | 1 Comment

Does Pointwise Convergence to a Continuous Function Imply Uniform Convergence?

Recently in my advanced calculus class we discussed how the uniform limit of a sequence of continuous functions is itself continuous.  One of my students turned the question around: If the limit of a pointwise sequence of continuous functions is continuous, does that mean that the convergence is uniform?

The answer is “No,” although I couldn’t think of a counterexample off the top of my head.  I found one (not surprisingly) on Math.SE.  Although there are some good discussion and multiple good answers to this question given there,  I’m going to discuss Ilmari Karonen’s answer.

Let \displaystyle f_n: [0,1] \mapsto \mathbb{R} be given by

\displaystyle f_n(x) = \begin{cases} nx, & 0 \leq x \leq 1/n; \\ 2 - nx, & 1/n < x \leq 2/n; \\ 0, & \text{ otherwise}. \end{cases}

Each f_n is continuous (piecewise, but that’s fine).  For any x \in (0,1], there exists N such that x > 2/n for all n \geq N.  Since f_n(0) = 0 for all n, this means that f_n(x) \to 0 for all x \in [0,1].  The zero function is clearly continuous.

However, for each n \in \mathbb{N} we have that f_n(1/n) = 1.  This means that when \epsilon < 1 there cannot be an N \in \mathbb{N} such that |f_n(x)| < \epsilon for all x \in [0,1] and n \geq N.  Thus the convergence cannot be uniform.

Posted in real analysis | 2 Comments