Chapter 1. Basics of Statistics
Recommended Post : 【Statistics】 Statistics Overview
1. Introduction
2. Basic statistical terminology
1. Introduction
⑴ probability and Statistics
① probability: a mathematical and theoretical study of possibilities
② Statistics: the study of collecting, analyzing, interpreting, or expressing data
○ probability and statistics are similar but clearly different
③ descriptive Statistics: statistical technique that summarizes and describes data
④ inference Statistics: statistical techniques that calculate the probability of a value occurring beyond the summary of numbers
⑵ the meaning of probability
① Frequentist
○ claims that probability is an intrinsic attribute of an object
○ example: a coin is an object that has a half probability at the front and the back
② Bayesianist
○ claims that probability is nothing but human belief
○ example: investigating the frequency of the front and back of a coin does not really prove that each frequency is one-half
⑶ trends in Statistics
① major issues in classical Statistics: finding distribution, increasing power
② major issues in modern Statistics: big data, machine learning
2. Basic statistical terminology
⑴ average (mean)
⑵ most frequent number (mode)
⑶ central value (median)
① the value in the middle in order: both sides should be equal in width relative to the median in the probability distribution
② less sensitive to changes in distribution than average
○ only change at the right-hand part of the median changes cannot change median value
○ thus, the median value is less sensitive
○ unexpectedly, many people do not know
3. Definition of data
⑴ data, information, and knowledge
① data : given data
② information : Name of the data
③ knowledge: The relationship between information and information
⑵ types of datasets
① relation
② tree
③ network
⑶ type of attribute
① class 1. continuous type data: quantitative data
○ 1-1. ratio scale (ratio scale): 1st-ranked scale
○ absolute zero + same interval + rank + category
○ concept of ratio between scales can be established
○ absolute zero exists: no negative concept
○ example: absolute temperature
○ 1-2. interval scale (interval scale): second-order scale
○ same interval + rank + category
○ the concept of ratio between scales cannot be established
○ no absolute zero: negative concept exists
○ example : Celsius temperature, Fahrenheit temperature
② class 2. categorical data: qualitative data
○ 2-1. ordinal scale
○ rank + category
○ the intervals cannot be said to be equal to each other: quantification and averaging are impossible
○ example : 2 students in 3rd grade and 2 students in 1st grade cannot be considered as 2 students in 2nd grade on average
○ 2-2. nominal scale (categorical scale)
○ category: same as name for each material
○ example: gender, blood type
⑷ attribute semantics
① spatial : quantitative
② temporal : quantitative
③ sequential : ordinal
④ diverging : quantitative
⑤ cyclic : categorical, ordinal, quantitative
⑥ hierarchical: categorical
4. Visualization of data
⑴ classification of analysis
① frequency analysis: An analysis that identifies the characteristics of a distribution for one categorical variable
② cross analysis: an analysis that identifies the characteristics of the combination distribution for two or more categorical variables. it is able to analyze independence and relevance
⑵ advantageous forms of expression according to data types: the more advantageous the data type is at higher
① quantitative variable
○ position
○ length
○ angle
○ slope
○ area
○ volume
○ density
○ color saturation
○ color hue
○ texture
○ connection
○ containment
○ shape
② ordinal variable
○ position
○ density
○ color saturation
○ color hue
○ texture
○ connection
○ containment
○ length
○ angle
○ slope
○ area
○ volume
○ shape
③ nominal variable
○ position
○ color hue
○ texture
○ connection
○ containment
○ density
○ color saturation
○ shape
○ length
○ angle
○ slope
○ area
○ volume
⑶ class 1. representation of two-dimensional information
① bar chart: categorical / ordinal (1D) + quantitative (1D)
○ when categorical / oridnal variable is on the x-axis: long labeling is possible
○ when categorical / oridnal variable is on the y-axis: it is possible to increase the number of variables
② line chart: ordinal / quantitative (1D) + quantitative (1D)
③ scatter plot: quantitative (1D) + quantitative (1D)
④ slope chart: quantitative (1D) + quantitative (1D). an alternative for scatter plot
⑤ histogram
⑥ pie chart
⑦ box plot
⑧ stem and leaf figures
⑷ class 2. representation of three dimensional information
① matrix: categorical / ordinal (1D) + categorical / ordinal (1D) + quantitative (1D, color) (+ quantitative (1D, point size))
② extended barchart: stacked bar chart, grouped bar chart, etc
③ extended line chart: area chart (≒ stacked line chart), etc
④ extended scatter plot: bubble chart (cf. point size also can be a variable), etc
⑤ symbol map: spatial (2D) + quantitative (1D)
⑸ class 3. representation of multidimensional information
① faceting: expressing two or three dimensional information for each parameter. multiple types of graphs are produced
② Chernoff face
③ star plot: also called as spider plot, radar chart, cobweb chart, or polar chart
⑹ 1-1. bar chart (bar graph)
① definition: a graph of nominal scale data
② generally, there is a gap between bars
③ R programming
plot(c(1, 2, 3), c(4, 5, 6), main = "BASIC PLOT")
④ Python programming: Bokeh is used for web-page visualization
from bokeh.plotting import figure, output_file, show output_file("stacked_bar.html") graph = figure(width = 400, height = 400, title = "Bokeh Vertical Bar Graph", tooltips=[("x", "$x"), ("y", "$y")] ) x = [1, 2, 3, 4, 5] top = [1, 2, 3, 4, 5] width = 0.5 graph.vbar(x, top = top, width = width) show(graph)
⑺ 1-2. line chart
① Python programming: Bokeh is used for web-page visualization
from bokeh.plotting import figure, output_file, show output_file("line_chart.html") p = figure(width=400, height=400, title = "Line Chart", tooltips=[("x", "$x"), ("y", "$y")]) p.line([1, 2, 3, 4, 5], [6, 7, 2, 4, 5], line_width=2) show(p)
⑻ 1-3. scatter plot
① definition
② scatter plot with marginal histogram
③ Python programming: Bokeh is used for web-page visualization
from bokeh.plotting import figure, output_file, show output_file("scatter_plot.html") p = figure(width=400, height=400, title = "Scatter Plot", tooltips=[("x", "$x"), ("y", "$y")]) p.circle([1, 2, 3, 4, 5], [6, 7, 2, 4, 5], size=20, color="navy", alpha=0.5) show(p)
⑼ 1-4. histogram
① definition: determination of intervals for continuous data of ratio scale, interval scale and graphically expression
② generally no gap between bars
③ 3D Histogram
④ R programming
hist(c(1, 2, 2, 3, 3, 3), col = "light yellow")
⑽ 1-5. circle graph (pie chart)
① definition: a circular graph of continuous/discontinuous data on the ratio scale expressed in percent.
② R programming
pie(c(1, 2, 2, 3, 3, 3), label = c("a", "b", "c", "d", "e", "f"), main = "PIE CHART")
⑾ 1-6. box chart (whisker plot)
① quantile
○ quantile function: Inverse function of cumulative distribution function
○ definitions :{x 0 ≤ x ≤ 1}
○ range : Statistics of the group of interest
○ depending on the number of sections, there are percentages, quartiles, etc.
② from the below lower bound, 1st quartile, median, 3th quartile, and upper bound are represented
○ The average may noted, or otherwise may not
③ R programming
boxplot(c(1, 2, 2, 3, 3, 3))
④ example
⑿ 1-7. stem and leaf
① stems indicate tens place. leaves indicate ones place
② the numbers in parentheses refer the frequency of stems
⒀ 1-8. quantile-quantile plot
⒁ 2-1. stacked bar chart
① Python programming: Bokeh is used for web-page visualization
from bokeh.plotting import figure, output_file, show output_file("hbar_stack.html") p = figure(width=400, height=400, title = "Horizontal Stacked Bar Chart", tooltips=[("x", "$x"), ("y", "$y")]) source = ColumnDataSource(data=dict( y=[1, 2, 3, 4, 5], x1=[1, 2, 4, 3, 4], x2=[1, 4, 2, 2, 3], )) p.hbar_stack(['x1', 'x2'], y='y', height=0.8, color=("grey", "lightgrey"), source=source) show(p)
⒂ 2-2. area chart
① Python programming: Bokeh is used for web-page visualization
import numpy as np from bokeh.models import ColumnDataSource from bokeh.plotting import figure, output_file, show output_file("varea_stack.html") source = ColumnDataSource(data=dict( x=[1, 2, 3, 4, 5], y1=[1, 2, 4, 3, 4], y2=[1, 4, 2, 2, 3], )) p = figure(width=400, height=400, title = "Area Chart", tooltips=[("x", "$x"), ("y", "$y")]) p.varea_stack(['y1', 'y2'], x='x', color=("grey", "lightgrey"), source=source) show(p)
⒃ 2-3. symbol map
① Python programming: Bokeh is used for web-page visualization
import numpy as np from bokeh.io import output_file, show from bokeh.plotting import figure from bokeh.transform import linear_cmap from bokeh.util.hex import hexbin output_file("hex_tile.html") n = 50000 x = np.random.standard_normal(n) y = np.random.standard_normal(n) bins = hexbin(x, y, 0.1) p = figure(width=400, height=400, title = "Symbol Map", match_aspect=True, background_fill_color='#440154', tooltips=[("x", "$x"), ("y", "$y")]) p.grid.visible = False p.hex_tile(q="q", r="r", size=0.1, line_color=None, source=bins, fill_color=linear_cmap('counts', 'Viridis256', 0, max(bins.counts))) show(p)
Input : 2019.09.11 15:15
Revised : 2022.03.13 18:21