Pythonで統計学を学ぶ(2)

この内容は山田、杉澤、村井(2008)「R」によるやさしい統計学を参考にしています。

この講義では、「2つの変数の記述統計」をとりあげます。これは、2つの変数を対象として、変数同士の関係を捉える、というものです。特に量的変数同士と、質的変数同士の関係を取り上げます。

学習項目です

2つの変数の関係
量的変数の関係(相関)を図で表す: 散布図
量的変数の関係(相関)を数値で表す: 共分散、相関係数
質的変数の関係(連関)を表す: クロス集計表、ファイ係数
関数のまとめ
演習問題

2つの変数の関係

統計で扱われるデータの種類には大きく分けて次の２つがあったことを復習しておきましょう:

量的データ、もしくは定量的データ

量的データとは、対象の量や大きさを表すもので、数値で表されます。これは次の２つに分けることができます。
- 連続型データ、もしくは計量型データ
  連続した値をとるものです。例えば製品の長さや重さは「実数」で表されますので、これにあたります。
- 離散型データ、もしくは計数型データ
  個数のように、飛び飛びの値、つまり「整数」で表されるもの
質的データ、もしくは定性的データ

質的データとは、対象の性質や種類などを表すもので、言葉や記号で表わされるものです。 </UL> 例えば、数学のテストの点数や物理のテストの点数は『量的データ』です。ここであるクラスの学生について考えると、個々人の数学テストの点数は「数学テストの点数」という変数のいろいろな具体的データとみなせます。同様に個々人の物理テストの点数も「物理テストの点数」という変数のいろいろな具体例とみなせます。これらを量的変数といい、2つの量的変数の間の関係を相関と言います。

一方、あるクラスの学生について、「数学が好きかきらいか」というデータは質的データと考えられ、「数学の好き嫌い」という変数の具体例とみなせます。このような変数を質的変数といい、例えば『数学の好き嫌い』と『物理の好き嫌い』という2つの質的変数の間の関係を連関と言います。

ここでは、相関と連関について学びます。

量的変数の関係(相関)を図で表す: 散布図

2つの変数、例えば、同じ人の「数学テストの点数」と「英語テストの点数」の間の関係を考えるのに、図を書いてみるということがよく行われます。

散布図</font>とは、学生の2つの科目の成績のように、対応のあるデータを2次元の平面上にプロットして得られる図のことです。これは2つのデータの間の関係を調べるのに利用されます。例えば、あるクラスの数学と英語の成績が以下のようだったとします。ここで、数学と英語の点数は学籍番号順に並んでいる、つまり学籍番号が3番の学生は数学の成績が14点、英語は12点だったとします。

Math = np.array([17, 13, 14, 7, 12, 10, 6, 8, 15, 4, 14, 9, 6, 10, 12, 5, 12, 8, 8, 12, 15, 18])
Eng  = np.array([14, 10, 12, 9, 10, 12, 1, 6, 16, 1, 12, 13, 11, 11, 16, 8, 11, 12, 6, 14, 17, 20])

この散布図はmatplotlib.pltモジュールのscatter関数を用いて次のようにして得られます。 (plot関数を使ってもできますが、オプションを設定する必要があります)

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

Math = np.array([17, 13, 14, 7, 12, 10, 6, 8, 15, 4, 14, 9, 6, 10, 12, 5, 12, 8, 8, 12, 15, 18])
Eng = np.array([14, 10, 12, 9, 10, 12, 1, 6, 16, 1, 12, 13, 11, 11, 16, 8, 11, 12, 6, 14, 17, 20])
plt.scatter(Math, Eng)
plt.xlabel('Math')
plt.ylabel('Eng')

<matplotlib.text.Text at 0x7f0ca4d5ea50>

plt.plot(Math, Eng,'o')  # 'o'のような引数がないと線が描画される
plt.xlabel('Math')
plt.ylabel('Eng')
# 表示される範囲がかなりぎりぎりになるのでaxis関数で表示範囲を設定したほうが良い
plt.axis([2,20,0,25])

[2, 20, 0, 25]

ここでは数学の成績をx座標、英語の成績をy座標として、それぞれの学生の2つの科目の成績を平面上に表示しました。このグラフを見ると、数学(x)と英語(y)の成績はなんとなく関係がありそうに見えます。つまり、xの値が増加するとyの値が増加するという関係です。これは、xとyの間に正の相関があるといいます。これとは逆にxの値が増加するとyの値が減少する関係もあります。このときはxとyの間に負の相関があるといいます。もちろんこのような関係がない場合もあります。その場合、xとyは無相関であるといいます:

正の相関：変数xが大きいほど変数yも大きくなる傾向がある場合。つまり、片方が増えると他方も増える関係である。
負の相関：変数xが大きいほど変数yは小さくなる傾向がある場合。つまり、片方が増えると他方が減る関係である。
無相関：変数xの大小の変化と変数yの大小の変化との間には関係がない場合。

下の図の(1)と(2)は相関があるとみなせる場合の散布図、(3)は無相関とみなせる場合の散布図です。直線はy=xを表す直線で、これに沿った点が多いほど正の相関があると考えられます。また楕円はその中に多くのデータが入るよう書いたものです。(2)の弱い相関の場合、(1)とくらべて散らばりが広がっていること、無相関の場合はデータをカバーする楕円がほぼ円になっていることが見て取れます。

import numpy as np
Parent = np.array([175, 170, 165, 160, 182, 177, 160, 176, 161, 170, 172])
Child  = np.array([172, 173, 170, 168, 177, 172, 171, 172, 162, 167, 172])

20.709090909090914

0.0020709090909090903

Help on function cov in module numpy.lib.function_base:

cov(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None, aweights=None)
    Estimate a covariance matrix, given data and weights.
    
    Covariance indicates the level to which two variables vary together.
    If we examine N-dimensional samples, :math:`X = [x_1, x_2, ... x_N]^T`,
    then the covariance matrix element :math:`C_{ij}` is the covariance of
    :math:`x_i` and :math:`x_j`. The element :math:`C_{ii}` is the variance
    of :math:`x_i`.
    
    See the notes for an outline of the algorithm.
    
    Parameters
    ----------
    m : array_like
        A 1-D or 2-D array containing multiple variables and observations.
        Each row of `m` represents a variable, and each column a single
        observation of all those variables. Also see `rowvar` below.
    y : array_like, optional
        An additional set of variables and observations. `y` has the same form
        as that of `m`.
    rowvar : bool, optional
        If `rowvar` is True (default), then each row represents a
        variable, with observations in the columns. Otherwise, the relationship
        is transposed: each column represents a variable, while the rows
        contain observations.
    bias : bool, optional
        Default normalization (False) is by ``(N - 1)``, where ``N`` is the
        number of observations given (unbiased estimate). If `bias` is True, then
        normalization is by ``N``. These values can be overridden by using the
        keyword ``ddof`` in numpy versions >= 1.5.
    ddof : int, optional
        If not ``None`` the default value implied by `bias` is overridden.
        Note that ``ddof=1`` will return the unbiased estimate, even if both
        `fweights` and `aweights` are specified, and ``ddof=0`` will return
        the simple average. See the notes for the details. The default value
        is ``None``.
    
        .. versionadded:: 1.5
    fweights : array_like, int, optional
        1-D array of integer freguency weights; the number of times each
        observation vector should be repeated.
    
        .. versionadded:: 1.10
    aweights : array_like, optional
        1-D array of observation vector weights. These relative weights are
        typically large for observations considered "important" and smaller for
        observations considered less "important". If ``ddof=0`` the array of
        weights can be used to assign probabilities to observation vectors.
    
        .. versionadded:: 1.10
    
    Returns
    -------
    out : ndarray
        The covariance matrix of the variables.
    
    See Also
    --------
    corrcoef : Normalized covariance matrix
    
    Notes
    -----
    Assume that the observations are in the columns of the observation
    array `m` and let ``f = fweights`` and ``a = aweights`` for brevity. The
    steps to compute the weighted covariance are as follows::
    
        >>> w = f * a
        >>> v1 = np.sum(w)
        >>> v2 = np.sum(w * a)
        >>> m -= np.sum(m * w, axis=1, keepdims=True) / v1
        >>> cov = np.dot(m * w, m.T) * v1 / (v1**2 - ddof * v2)
    
    Note that when ``a == 1``, the normalization factor
    ``v1 / (v1**2 - ddof * v2)`` goes over to ``1 / (np.sum(f) - ddof)``
    as it should.
    
    Examples
    --------
    Consider two variables, :math:`x_0` and :math:`x_1`, which
    correlate perfectly, but in opposite directions:
    
    >>> x = np.array([[0, 2], [1, 1], [2, 0]]).T
    >>> x
    array([[0, 1, 2],
           [2, 1, 0]])
    
    Note how :math:`x_0` increases while :math:`x_1` decreases. The covariance
    matrix shows this clearly:
    
    >>> np.cov(x)
    array([[ 1., -1.],
           [-1.,  1.]])
    
    Note that element :math:`C_{0,1}`, which shows the correlation between
    :math:`x_0` and :math:`x_1`, is negative.
    
    Further, note how `x` and `y` are combined:
    
    >>> x = [-2.1, -1,  4.3]
    >>> y = [3,  1.1,  0.12]
    >>> X = np.vstack((x,y))
    >>> print(np.cov(X))
    [[ 11.71        -4.286     ]
     [ -4.286        2.14413333]]
    >>> print(np.cov(x, y))
    [[ 11.71        -4.286     ]
     [ -4.286        2.14413333]]
    >>> print(np.cov(x))
    11.71

課題2-1

次の表には、親子の身長のデータ(単位はcm)がある。これから親と子の身長には相関があるかどうかを散布図を書いて答えよ。

Parent = np.array([175, 170, 165, 160, 182, 177, 160, 176, 161, 170, 172])
Child   =np.araray([(172, 173, 170, 168, 177, 172, 171, 172, 162, 167, 172])

[ヒント] 散布図は scatter 関数で書けます。散布図で表示された「点」がどの程度 y = ax + b (a, bは定数) という直線にどのくらいそっているかが、相関の強さになります。

import numpy as np
Parent = np.array([175, 170, 165, 160, 182, 177, 160, 176, 161, 170, 172])
Child  = np.array([172, 173, 170, 168, 177, 172, 171, 172, 162, 167, 172])

量的変数の関係(相関)を数値で表す: 共分散、相関係数

偏差とは、それぞれのデータとその平均との差のことです。 2つの変数x, y の共分散s_xyとは、 xとyそれぞれの偏差の積のことで、次の式で表されます(ここでmxはxの平均、myはyの平均とし、データの個数を nとする): $$ s_{xy} = \frac{ (x_1 - \bar{x})(y_1 - \bar{y}) + (x_2 - \bar{x})(y_2 - \bar{y}) + \ldots + (x_n - \bar{x})(y_n - \bar{y})}{n}$$ そして、Pythonでは共分散をnumpytモジュールのcov関数を用いて求めることができます(注意: cov関数は分散共分散行列を返します）

課題2-2

課題2-1の親子の身長のデータに対し、共分散の値を求めよ。また、身長をメートル(m)単位に直した場合の共分散の値を求め、2つを比較せよ。

[ヒント] 変数xとyの分散共分散行列は cov(x,y)で求めることができます。また、ParentデータもChildデータもcm単位ですが、それをm単位にするには 100 で割ればよろしい。

np.cov(Parent,Child)[0,1]

20.709090909090914

np.cov(Parent*0.01,Child*0.01)[0,1]

0.0020709090909090903

help(np.cov)

Help on function cov in module numpy.lib.function_base:

cov(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None, aweights=None)
    Estimate a covariance matrix, given data and weights.
    
    Covariance indicates the level to which two variables vary together.
    If we examine N-dimensional samples, :math:`X = [x_1, x_2, ... x_N]^T`,
    then the covariance matrix element :math:`C_{ij}` is the covariance of
    :math:`x_i` and :math:`x_j`. The element :math:`C_{ii}` is the variance
    of :math:`x_i`.
    
    See the notes for an outline of the algorithm.
    
    Parameters
    ----------
    m : array_like
        A 1-D or 2-D array containing multiple variables and observations.
        Each row of `m` represents a variable, and each column a single
        observation of all those variables. Also see `rowvar` below.
    y : array_like, optional
        An additional set of variables and observations. `y` has the same form
        as that of `m`.
    rowvar : bool, optional
        If `rowvar` is True (default), then each row represents a
        variable, with observations in the columns. Otherwise, the relationship
        is transposed: each column represents a variable, while the rows
        contain observations.
    bias : bool, optional
        Default normalization (False) is by ``(N - 1)``, where ``N`` is the
        number of observations given (unbiased estimate). If `bias` is True, then
        normalization is by ``N``. These values can be overridden by using the
        keyword ``ddof`` in numpy versions >= 1.5.
    ddof : int, optional
        If not ``None`` the default value implied by `bias` is overridden.
        Note that ``ddof=1`` will return the unbiased estimate, even if both
        `fweights` and `aweights` are specified, and ``ddof=0`` will return
        the simple average. See the notes for the details. The default value
        is ``None``.
    
        .. versionadded:: 1.5
    fweights : array_like, int, optional
        1-D array of integer freguency weights; the number of times each
        observation vector should be repeated.
    
        .. versionadded:: 1.10
    aweights : array_like, optional
        1-D array of observation vector weights. These relative weights are
        typically large for observations considered "important" and smaller for
        observations considered less "important". If ``ddof=0`` the array of
        weights can be used to assign probabilities to observation vectors.
    
        .. versionadded:: 1.10
    
    Returns
    -------
    out : ndarray
        The covariance matrix of the variables.
    
    See Also
    --------
    corrcoef : Normalized covariance matrix
    
    Notes
    -----
    Assume that the observations are in the columns of the observation
    array `m` and let ``f = fweights`` and ``a = aweights`` for brevity. The
    steps to compute the weighted covariance are as follows::
    
        >>> w = f * a
        >>> v1 = np.sum(w)
        >>> v2 = np.sum(w * a)
        >>> m -= np.sum(m * w, axis=1, keepdims=True) / v1
        >>> cov = np.dot(m * w, m.T) * v1 / (v1**2 - ddof * v2)
    
    Note that when ``a == 1``, the normalization factor
    ``v1 / (v1**2 - ddof * v2)`` goes over to ``1 / (np.sum(f) - ddof)``
    as it should.
    
    Examples
    --------
    Consider two variables, :math:`x_0` and :math:`x_1`, which
    correlate perfectly, but in opposite directions:
    
    >>> x = np.array([[0, 2], [1, 1], [2, 0]]).T
    >>> x
    array([[0, 1, 2],
           [2, 1, 0]])
    
    Note how :math:`x_0` increases while :math:`x_1` decreases. The covariance
    matrix shows this clearly:
    
    >>> np.cov(x)
    array([[ 1., -1.],
           [-1.,  1.]])
    
    Note that element :math:`C_{0,1}`, which shows the correlation between
    :math:`x_0` and :math:`x_1`, is negative.
    
    Further, note how `x` and `y` are combined:
    
    >>> x = [-2.1, -1,  4.3]
    >>> y = [3,  1.1,  0.12]
    >>> X = np.vstack((x,y))
    >>> print(np.cov(X))
    [[ 11.71        -4.286     ]
     [ -4.286        2.14413333]]
    >>> print(np.cov(x, y))
    [[ 11.71        -4.286     ]
     [ -4.286        2.14413333]]
    >>> print(np.cov(x))
    11.71

相関係数

散布図は2つのデータの間の関係を調べるのに利用されますが、相関があるかそれとも無相関かをグラフで判断することは難しいものがありました。そのため相関の関係を表す指標を数値で表すことを考えます。ただ、その指標として共分散を用いるのでは、課題2-2でみたように、例えば単位をmにするかcmにするかで値が大きく違ってしまいます。そこで相関係数、言い換えれば相関の度合いを数値で表すことが考えられています。よく使われるのはピアソンの相関係数で、次の式で定義されます(s_xyは変数xとyの共分散、s_xとs_yはそれぞれxとyの標準偏差):
$$ r_{xy} = \frac{s_{xy}}{s_x s_y} $$

相関係数rは-1≦r≦1の範囲の値をとります。そして次の表に示すように、rの値によって相関のあるなしの評価が行われます:

np.corrcoef(Math,Eng)

array([[ 1.        ,  0.78822657],
       [ 0.78822657,  1.        ]])

数学と英語の成績の相関係数が0.79という値が出ました。これは先の表から、強い相関がありそうだ、ということがわかりました。

このように強い相関があるとき、yとxの関係をy = a*x + b(a, bは定数)という形の一次方程式で表すことが考えられます。ここで必要なのは傾きを表すaと切片を表bという定数で、Pythonではnumpyモジュールのpolyfit関数で求めることができます(１次関数なので第３引数として1を指定します)。

np.polyfit( Math,Eng,1)  # 1次関数を求める

array([ 0.93603919,  1.00139958])

help(np.polyfit)

Help on function polyfit in module numpy.lib.polynomial:

polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False)
    Least squares polynomial fit.
    
    Fit a polynomial ``p(x) = p[0] * x**deg + ... + p[deg]`` of degree `deg`
    to points `(x, y)`. Returns a vector of coefficients `p` that minimises
    the squared error.
    
    Parameters
    ----------
    x : array_like, shape (M,)
        x-coordinates of the M sample points ``(x[i], y[i])``.
    y : array_like, shape (M,) or (M, K)
        y-coordinates of the sample points. Several data sets of sample
        points sharing the same x-coordinates can be fitted at once by
        passing in a 2D-array that contains one dataset per column.
    deg : int
        Degree of the fitting polynomial
    rcond : float, optional
        Relative condition number of the fit. Singular values smaller than
        this relative to the largest singular value will be ignored. The
        default value is len(x)*eps, where eps is the relative precision of
        the float type, about 2e-16 in most cases.
    full : bool, optional
        Switch determining nature of return value. When it is False (the
        default) just the coefficients are returned, when True diagnostic
        information from the singular value decomposition is also returned.
    w : array_like, shape (M,), optional
        Weights to apply to the y-coordinates of the sample points. For
        gaussian uncertainties, use 1/sigma (not 1/sigma**2).
    cov : bool, optional
        Return the estimate and the covariance matrix of the estimate
        If full is True, then cov is not returned.
    
    Returns
    -------
    p : ndarray, shape (M,) or (M, K)
        Polynomial coefficients, highest power first.  If `y` was 2-D, the
        coefficients for `k`-th data set are in ``p[:,k]``.
    
    residuals, rank, singular_values, rcond :
        Present only if `full` = True.  Residuals of the least-squares fit,
        the effective rank of the scaled Vandermonde coefficient matrix,
        its singular values, and the specified value of `rcond`. For more
        details, see `linalg.lstsq`.
    
    V : ndarray, shape (M,M) or (M,M,K)
        Present only if `full` = False and `cov`=True.  The covariance
        matrix of the polynomial coefficient estimates.  The diagonal of
        this matrix are the variance estimates for each coefficient.  If y
        is a 2-D array, then the covariance matrix for the `k`-th data set
        are in ``V[:,:,k]``
    
    
    Warns
    -----
    RankWarning
        The rank of the coefficient matrix in the least-squares fit is
        deficient. The warning is only raised if `full` = False.
    
        The warnings can be turned off by
    
        >>> import warnings
        >>> warnings.simplefilter('ignore', np.RankWarning)
    
    See Also
    --------
    polyval : Compute polynomial values.
    linalg.lstsq : Computes a least-squares fit.
    scipy.interpolate.UnivariateSpline : Computes spline fits.
    
    Notes
    -----
    The solution minimizes the squared error
    
    .. math ::
        E = \sum_{j=0}^k |p(x_j) - y_j|^2
    
    in the equations::
    
        x[0]**n * p[0] + ... + x[0] * p[n-1] + p[n] = y[0]
        x[1]**n * p[0] + ... + x[1] * p[n-1] + p[n] = y[1]
        ...
        x[k]**n * p[0] + ... + x[k] * p[n-1] + p[n] = y[k]
    
    The coefficient matrix of the coefficients `p` is a Vandermonde matrix.
    
    `polyfit` issues a `RankWarning` when the least-squares fit is badly
    conditioned. This implies that the best fit is not well-defined due
    to numerical error. The results may be improved by lowering the polynomial
    degree or by replacing `x` by `x` - `x`.mean(). The `rcond` parameter
    can also be set to a value smaller than its default, but the resulting
    fit may be spurious: including contributions from the small singular
    values can add numerical noise to the result.
    
    Note that fitting polynomial coefficients is inherently badly conditioned
    when the degree of the polynomial is large or the interval of sample points
    is badly centered. The quality of the fit should always be checked in these
    cases. When polynomial fits are not satisfactory, splines may be a good
    alternative.
    
    References
    ----------
    .. [1] Wikipedia, "Curve fitting",
           http://en.wikipedia.org/wiki/Curve_fitting
    .. [2] Wikipedia, "Polynomial interpolation",
           http://en.wikipedia.org/wiki/Polynomial_interpolation
    
    Examples
    --------
    >>> x = np.array([0.0, 1.0, 2.0, 3.0,  4.0,  5.0])
    >>> y = np.array([0.0, 0.8, 0.9, 0.1, -0.8, -1.0])
    >>> z = np.polyfit(x, y, 3)
    >>> z
    array([ 0.08703704, -0.81349206,  1.69312169, -0.03968254])
    
    It is convenient to use `poly1d` objects for dealing with polynomials:
    
    >>> p = np.poly1d(z)
    >>> p(0.5)
    0.6143849206349179
    >>> p(3.5)
    -0.34732142857143039
    >>> p(10)
    22.579365079365115
    
    High-order polynomials may oscillate wildly:
    
    >>> p30 = np.poly1d(np.polyfit(x, y, 30))
    /... RankWarning: Polyfit may be poorly conditioned...
    >>> p30(4)
    -0.80000000000000204
    >>> p30(5)
    -0.99999999999999445
    >>> p30(4.5)
    -0.10547061179440398
    
    Illustration:
    
    >>> import matplotlib.pyplot as plt
    >>> xp = np.linspace(-2, 6, 100)
    >>> _ = plt.plot(x, y, '.', xp, p(xp), '-', xp, p30(xp), '--')
    >>> plt.ylim(-2,2)
    (-2, 2)
    >>> plt.show()

この結果から、Eng = 0.936 * Math + 1.001という方程式で説明できそうということがわかりました。なおこのような分析を単回帰分析と言います。

せっかくですから、これを先の散布図に重ねて表示してみましょう。

lm = np.polyfit(Math, Eng, 1)

plt.plot(Math, Eng,'o')  # 'o'のような引数がないと線が描画される
plt.xlabel('Math')
plt.ylabel('Eng')
# 表示される範囲がかなりぎりぎりになるのでaxis関数で表示範囲を設定したほうが良い
plt.axis([2,20,0,25])

x = np.linspace(2.0, 30.0, 10000)
plt.plot(x, lm[0]*x+lm[1],"g")

[<matplotlib.lines.Line2D at 0x7f0ca475af90>]

課題2-3

課題2-1の親と子の身長データから、これらの間には相関があるかどうかを相関係数を求めて答えよ。また、その散布図と、単回帰分析によって得られる一次方程式のグラフを重ね書きして表示せよ。

課題2-4

corrcoef関数は、cov関数で与えられる共分散を分子、std関数で与えられる(標本分散の平方根である)標準偏差を分母として計算されていることを、課題2-1のデータを使って確かめよ。もしも不偏共分散と、不偏分散の平方根とを用いて計算したときの、課題2-1のデータの相関係数はいくらになるか、求めよ。

質的変数の関係(連関)を表す: クロス集計表、ファイ係数

クロス集計表とは、質的変数同士の関係を見るのに使われます。例えば、数学の好き・嫌いと統計学の好き・嫌いの間に連関があるかどうかは、それぞれの変数 (「Math(数学)」と「Stat(統計学)」)が質的変数ですから、クロス集計表を書いて調べることになります。

ここで数学変数と統計学変数の値を次のように定めることにしましょう:

import numpy as np
Math = np.array(["嫌い","嫌い","好き","好き","嫌い","嫌い","嫌い","嫌い","嫌い","好き","好き","嫌い",
                 "好き","嫌い","嫌い","好き","嫌い","嫌い","嫌い","嫌い"])
Stat = np.array(["好き","好き","好き","好き","嫌い","嫌い","嫌い","嫌い","嫌い","嫌い","好き","好き",
                 "好き","嫌い","好き","嫌い","嫌い","嫌い","嫌い","嫌い"])

クロス集計表を作るにはpandasモジュールのcrosstab関数を用います:

import pandas as pd
data = pd.DataFrame({'Math':Math, 'Stat':Stat})
pd.crosstab(data.Math,data.Stat,margins=True)

help(pd.crosstab)

Help on function crosstab in module pandas.tools.pivot:

crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True)
    Compute a simple cross-tabulation of two (or more) factors. By default
    computes a frequency table of the factors unless an array of values and an
    aggregation function are passed
    
    Parameters
    ----------
    index : array-like, Series, or list of arrays/Series
        Values to group by in the rows
    columns : array-like, Series, or list of arrays/Series
        Values to group by in the columns
    values : array-like, optional
        Array of values to aggregate according to the factors
    aggfunc : function, optional
        If no values array is passed, computes a frequency table
    rownames : sequence, default None
        If passed, must match number of row arrays passed
    colnames : sequence, default None
        If passed, must match number of column arrays passed
    margins : boolean, default False
        Add row/column margins (subtotals)
    dropna : boolean, default True
        Do not include columns whose entries are all NaN
    
    Notes
    -----
    Any Series passed will have their name attributes used unless row or column
    names for the cross-tabulation are specified
    
    Examples
    --------
    >>> a
    array([foo, foo, foo, foo, bar, bar,
           bar, bar, foo, foo, foo], dtype=object)
    >>> b
    array([one, one, one, two, one, one,
           one, two, two, two, one], dtype=object)
    >>> c
    array([dull, dull, shiny, dull, dull, shiny,
           shiny, dull, shiny, shiny, shiny], dtype=object)
    
    >>> crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
    b    one          two
    c    dull  shiny  dull  shiny
    a
    bar  1     2      1     0
    foo  2     2      1     2
    
    Returns
    -------
    crosstab : DataFrame

ここで、質的変数に対する「特殊な相関係数」であるファイ係数を紹介します。ファイ係数は、1と0からなる変数（二値変数）に対して計算される相関係数です。したがって、質的変数に対してこれを適用するには、まずデータを0, 1の値に変換しなければなりません。それを先ほど取り上げた数学変数と統計学変数に対して適用し、ファイ係数を求めてみることにしましょう。

まず、二値変数にするには、下に示すようにリストの内包記法を使えばできます。

MathDigitize = np.array([1 if x == "好き" else 0 for x in Math])
print(MathDigitize)
StatDigitize = np.array([1 if x == "好き" else 0  for x in Stat])
print(StatDigitize)

[0 0 1 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0]
[1 1 1 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0]

これで変数の値が数値化されましたので、関数corrcoefを適用して、ファイ係数が求まりました:

np.corrcoef(MathDigitize,StatDigitize)

array([[ 1.        ,  0.35634832],
       [ 0.35634832,  1.        ]])

この結果からは、数学変数と統計学変数の間には弱い連関があるということが言えそうです。

関数のまとめ

注: numpyをnp, matplotlib.pyplotをplt、pandasをpd、scipy.statsをstと略記する

目的	関数名とモジュール	使い方
散布図を書く	plt.scatter(データ1, データ2)	plt.scatter([17, 13, 14, 7], [12, 10, 6, 8])
散布図を書く(別版)	plt.plot(データ1, データ2, 'o') : 第3引数には'+'や'*'などのマークが選べる	np.plot([17, 13, 14, 7], [12, 10, 6, 8],'o')
分散共分散行列を求める	np.cov(データ1, データ2)	np.cov([17, 13, 14, 7], [12, 10, 6, 8])
相関係数を求める	np.corrcoef(データ1, データ2)	np.corrcoef([17, 13, 14, 7], [12, 10, 6, 8])
単回帰分析を行う	np.polyfit(データ1, データ2, 1)	np.polyfit(([17, 13, 14, 7], [12, 10, 6, 8], 1)
クロス集計表を作る	pd.crosstab(配列1, 配列2, margins=True)	pd.crosstab(np.array([0,1,0,0,1]), np.array([1,1,1,0,0]),margins=True)

演習問題¶

演習問題2-1¶

10人の大学生の１日の勉学時間(StudyHours, 単位は時間)と定期試験の得点(ExamResult,100点満点)のデータに対し、散布図を書け、またその相関係数を求めよ。

演習問題2-2¶

次は20人に対し、食事の好み（洋食か和食）と味の好み（甘党か辛党か）についてアンケート調査したものである。これに対しクロス集計表を求め、ファイ係数を求めよ。(注意: 番号のデータは使わない)

# のデータ
import pandas as pd
df = pd.DataFrame({'Food':["洋食","和食","和食","洋食","和食","洋食","洋食","和食","洋食","洋食","和食","洋食",
                         "和食","洋食","和食","和食","洋食","洋食","和食","和食"],
                   'Taste':["甘党","辛党","甘党","甘党","辛党","辛党","辛党","辛党","甘党","甘党","甘党","甘党",
                          "辛党","辛党","甘党","辛党","辛党","甘党","辛党","辛党"]})

演習問題2-3¶

相関と連関の違いについて述べよ。
2つの変数の相関を調べる場合、どういうときに散布図を使い、どういう場合に相関係数を使ったらよいか、考えを述べよ。
ある2つの変数の相関係数を出したところ、0.9という値が得られた。このことから、この2つの変数には強い相関があると言っても良いだろうか、あなたの考えを述べよ。(ヒント: データの個数などは関係ないだろうか?)
次の変数xとyの散布図を書き、単回帰分析してえられた直線を散布図に書き加えよ。また、相関係数を答えよ。

#演習問題2-3
import numpy as np
x = np.array([69, 70, 76, 69, 68, 74, 63, 79, 82, 74, 73, 66, 69, 71, 63, 73, 69, 63, 57, 71,
              77, 74, 66, 73, 63, 75, 68, 66, 69, 77])
y = np.array([71, 75, 73, 59, 72, 53, 55, 72, 70, 65, 76, 63, 58, 52, 63, 57, 62, 59, 47, 51,
              74, 52, 56, 61, 55, 70, 62, 66, 61, 63])

相関係数(r)	大きさの評価
$-0.2 \leq r \leq 0.2$	ほとんど相関なし
$-0.4 \leq r < -0.2$　および　$0.2 < r\leq 0.4$	弱い相関あり
$-0.7 \leq r < -0.4$　および　$0.4 <r \leq 0.7$	中程度の相関あり
$-1.0 \leq r < -0.7$　および　$0.7 < r \leq 1.0$	強い相関あり

Stat	好き	嫌い	All
Math
好き	4	2	6
嫌い	4	10	14
All	8	12	20