Korean, Edit

Chapter 1. Basics of Statistics

Recommended Post : 【Statistics】 Statistics Overview 


1. Introduction

2. Basic statistical terminology

3. Definition of data

4. Visualization of data


a. Quantile-Quantile Plot 



1. Introduction 

⑴ probability and Statistics

① probability: a mathematical and theoretical study of possibilities

② Statistics: the study of collecting, analyzing, interpreting, or expressing data

○ probability and statistics are similar but clearly different

③ descriptive Statistics: statistical technique that summarizes and describes data

④ inference Statistics: statistical techniques that calculate the probability of a value occurring beyond the summary of numbers

⑵ the meaning of probability

① Frequentist

○ claims that probability is an intrinsic attribute of an object

○ example: a coin is an object that has a half probability at the front and the back

② Bayesianist 

○ claims that probability is nothing but human belief

○ example: investigating the frequency of the front and back of a coin does not really prove that each frequency is one-half

⑶ trends in Statistics

① major issues in classical Statistics: finding distribution, increasing power

② major issues in modern Statistics: big data, machine learning



2. Basic statistical terminology 

⑴ average (mean)

⑵ most frequent number (mode)

⑶ central value (median)

① the value in the middle in order: both sides should be equal in width relative to the median in the probability distribution

② less sensitive to changes in distribution than average


drawing
Figure 1. mean (left) and median (right)


○ only change at the right-hand part of the median changes cannot change median value

○ thus, the median value is less sensitive 

○ unexpectedly, many people do not know



3. Definition of data

⑴ data, information, and knowledge

① data : given data

② information : Name of the data

③ knowledge: The relationship between information and information

⑵ types of datasets

① relation

② tree

③ network

⑶ type of attribute

class 1. continuous type data: quantitative data 

1-1. ratio scale (ratio scale): 1st-ranked scale

○ absolute zero + same interval + rank + category

○ concept of ratio between scales can be established

○ absolute zero exists: no negative concept

○ example: absolute temperature

1-2. interval scale (interval scale): second-order scale

○ same interval + rank + category

○ the concept of ratio between scales cannot be established

○ no absolute zero: negative concept exists

○ example : Celsius temperature, Fahrenheit temperature

class 2. categorical data: qualitative data

2-1. ordinal scale

○ rank + category

○ the intervals cannot be said to be equal to each other: quantification and averaging are impossible

○ example : 2 students in 3rd grade and 2 students in 1st grade cannot be considered as 2 students in 2nd grade on average

2-2. nominal scale (categorical scale)

○ category: same as name for each material

○ example: gender, blood type

⑷ attribute semantics

① spatial : quantitative 

② temporal : quantitative

③ sequential : ordinal

④ diverging : quantitative

⑤ cyclic : categorical, ordinal, quantitative

⑥ hierarchical: categorical 



4. Visualization of data

⑴ classification of analysis

frequency analysis: An analysis that identifies the characteristics of a distribution for one categorical variable

cross analysis: an analysis that identifies the characteristics of the combination distribution for two or more categorical variables. it is able to analyze independence and relevance

⑵ advantageous forms of expression according to data types: the more advantageous the data type is at higher

① quantitative variable

○ position

○ length 

○ angle

○ slope

○ area

○ volume

○ density

○ color saturation 

○ color hue: Considerations for Color Design for Color Blindness (ref)

○ texture

○ connection

○ containment

○ shape 

② ordinal variable 

○ position 

○ density

○ color saturation 

○ color hue 

○ texture 

○ connection

○ containment 

○ length 

○ angle 

○ slope 

○ area 

○ volume 

○ shape

③ nominal variable

○ position 

○ color hue 

○ texture 

○ connection 

○ containment 

○ density 

○ color saturation 

○ shape 

○ length 

○ angle 

○ slope 

○ area 

○ volume 

class 1. representation of two-dimensional information

① bar chart: categorical / ordinal (1D) + quantitative (1D)

○ when categorical / oridnal variable is on the x-axis: long labeling is possible

○ when categorical / oridnal variable is on the y-axis: it is possible to increase the number of variables

② line chart: ordinal / quantitative (1D) + quantitative (1D)

③ scatter plot: quantitative (1D) + quantitative (1D)

④ slope chart: quantitative (1D) + quantitative (1D). an alternative for scatter plot

⑤ histogram 

⑥ pie chart 

⑦ box plot 

⑧ stem and leaf figures

class 2. representation of three dimensional information

① matrix: categorical / ordinal (1D) + categorical / ordinal (1D) + quantitative (1D, color) (+ quantitative (1D, point size))

② extended barchart: stacked bar chart, grouped bar chart, etc 

③ extended line chart: area chart (≒ stacked line chart), etc

④ extended scatter plot: bubble chart (cf. point size also can be a variable), etc

⑤ symbol map: spatial (2D) + quantitative (1D)

class 3. representation of multidimensional information

① faceting: expressing two or three dimensional information for each parameter. multiple types of graphs are produced

② Chernoff face 

③ star plot: also called as spider plot, radar chart, cobweb chart, or polar chart

1-1. bar chart (bar graph)

① definition: a graph of nominal scale data

② generally, there is a gap between bars

③ R programming 

plot(c(1, 2, 3), c(4, 5, 6), main = "BASIC PLOT")


④ Python programming: Bokeh is used for web-page visualization


drawing
Figure. 2. bar chart represented by Bokeh 


from bokeh.plotting import figure, output_file, show

output_file("stacked_bar.html")
graph = figure(width = 400, height = 400, title = "Bokeh Vertical Bar Graph", 
               tooltips=[("x", "$x"), ("y", "$y")] )
x = [1, 2, 3, 4, 5]
top = [1, 2, 3, 4, 5]
width = 0.5
graph.vbar(x, top = top, width = width)
show(graph)


1-2. line chart

① Python programming: Bokeh is used for web-page visualization


drawing
Figure. 3. line chart represented by Bokeh


from bokeh.plotting import figure, output_file, show

output_file("line_chart.html")
p = figure(width=400, height=400, title = "Line Chart", 
           tooltips=[("x", "$x"), ("y", "$y")])
p.line([1, 2, 3, 4, 5], [6, 7, 2, 4, 5], line_width=2)
show(p)


⑻ 1-3. scatter plot

① definition

② scatter plot with marginal histogram 

③ Python programming: Bokeh is used for web-page visualization


drawing
Figure. 4. scatter plot represented by Bokeh


from bokeh.plotting import figure, output_file, show

output_file("scatter_plot.html")
p = figure(width=400, height=400, title = "Scatter Plot",
           tooltips=[("x", "$x"), ("y", "$y")])
p.circle([1, 2, 3, 4, 5], [6, 7, 2, 4, 5], size=20, color="navy", alpha=0.5)
show(p)


1-4. histogram

① definition: determination of intervals for continuous data of ratio scale, interval scale and graphically expression

② generally no gap between bars

③ 3D Histogram

④ R programming

hist(c(1, 2, 2, 3, 3, 3), col = "light yellow")


1-5. circle graph (pie chart)

① definition: a circular graph of continuous/discontinuous data on the ratio scale expressed in percent.

② R programming

pie(c(1, 2, 2, 3, 3, 3), label = c("a", "b", "c", "d", "e", "f"), main = "PIE CHART")


1-6. box chart (whisker plot)

① quantile

quantile function: Inverse function of cumulative distribution function

○ definitions :{x 0 ≤ x ≤ 1}

○ range : Statistics of the group of interest

○ depending on the number of sections, there are percentages, quartiles, etc.

② from the below lower bound, 1st quartile, median, 3th quartile, and upper bound are represented

○ The average may noted, or otherwise may not

③ R programming

boxplot(c(1, 2, 2, 3, 3, 3))


④ example


drawing
Figure. 5. an example of box graph **:** the answer is ①


1-7. stem and leaf


drawing
Figure. 6. stem and leaf 


① stems indicate tens place. leaves indicate ones place

② the numbers in parentheses refer the frequency of stems

1-8. quantile-quantile plot 

2-1. stacked bar chart

① Python programming: Bokeh is used for web-page visualization


drawing
Figure. 7. stacked bar chart represented by Bokeh


from bokeh.plotting import figure, output_file, show

output_file("hbar_stack.html")
p = figure(width=400, height=400, title = "Horizontal Stacked Bar Chart",
           tooltips=[("x", "$x"), ("y", "$y")])
source = ColumnDataSource(data=dict(
    y=[1, 2, 3, 4, 5],
    x1=[1, 2, 4, 3, 4],
    x2=[1, 4, 2, 2, 3],
))
p.hbar_stack(['x1', 'x2'], y='y', height=0.8, color=("grey", "lightgrey"), source=source)
show(p)


2-2. area chart

① Python programming: Bokeh is used for web-page visualization


drawing
Figure. 8. area chart represented by Bokeh


import numpy as np

from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, output_file, show

output_file("varea_stack.html")

source = ColumnDataSource(data=dict(
    x=[1, 2, 3, 4, 5],
    y1=[1, 2, 4, 3, 4],
    y2=[1, 4, 2, 2, 3],
))

p = figure(width=400, height=400, title = "Area Chart",
           tooltips=[("x", "$x"), ("y", "$y")])
p.varea_stack(['y1', 'y2'], x='x', color=("grey", "lightgrey"), source=source)
show(p)


⒃ 2-3. symbol map

① Python programming: Bokeh is used for web-page visualization


drawing
Figure. 9. symbol map represented by Bokeh


import numpy as np

from bokeh.io import output_file, show
from bokeh.plotting import figure
from bokeh.transform import linear_cmap
from bokeh.util.hex import hexbin

output_file("hex_tile.html")

n = 50000
x = np.random.standard_normal(n)
y = np.random.standard_normal(n)
bins = hexbin(x, y, 0.1)

p = figure(width=400, height=400, title = "Symbol Map", 
           match_aspect=True, background_fill_color='#440154',
           tooltips=[("x", "$x"), ("y", "$y")])
p.grid.visible = False

p.hex_tile(q="q", r="r", size=0.1, line_color=None, source=bins,
           fill_color=linear_cmap('counts', 'Viridis256', 0, max(bins.counts)))
show(p)



Input : 2019.09.11 15:15

Revised : 2022.03.13 18:21

results matching ""

    No results matching ""