Master the vocabulary of data science. Each term explained with intuition, math, and practical context.
Your data's personality doesn't change over time. The mean, variance, and autocorrelation stay consistent whether you look at January or July.
library(tseries)
# Run ADF test for stationarity
adf.test(my_data$Close)
# If p-value > 0.05, data is non-stationary
# Apply differencing to make it stationary
stationary_data <- diff(my_data$Close)Most time series models assume stationarity. If your data has trends or changing variance, predictions will be unreliable. You need to difference or transform non-stationary data first.
Stationary
Non-Stationary
The general direction your data is heading over time. Is it going up, down, or staying flat?
library(stats)
# Decompose time series into components
decomposed <- decompose(ts(my_data$Close, frequency = 12))
# Extract the trend component
trend <- decomposed$trend
plot(trend, main = "Trend Component")Identifying trends helps you understand long-term patterns. Removing trends (via differencing) is often the first step to making data stationary.
Repeating patterns at regular intervals. Like how ice cream sales spike every summer or how traffic increases every Monday morning.
# Create seasonal time series (monthly data)
ts_data <- ts(my_data$Close, frequency = 12, start = c(2024, 1))
# Decompose to extract seasonal component
decomposed <- decompose(ts_data)
seasonal <- decomposed$seasonal
# Plot seasonal pattern
plot(seasonal, main = "Seasonal Component")Ignoring seasonality leads to poor forecasts. Models like SARIMA explicitly handle seasonal patterns to improve accuracy.
How much today's value is related to values from previous days. It's like asking 'does yesterday predict today?'
# Plot Autocorrelation Function
acf(my_data$Close, main = "ACF Plot")
# Get numeric ACF values for first 10 lags
acf_values <- acf(my_data$Close, plot = FALSE)
print(acf_values$acf[1:10])
# Significant spikes beyond blue lines = useful lagsACF plots help you identify the 'q' parameter in ARIMA models. Significant spikes at certain lags reveal the structure of your data.
Sample ACF Plot
Instead of looking at actual values, you look at the change between consecutive values. It's like checking your daily weight change instead of total weight.
# First-order differencing (d = 1)
diff_1 <- diff(my_data$Close, differences = 1)
# Second-order differencing (d = 2)
diff_2 <- diff(my_data$Close, differences = 2)
# Plot original vs differenced
par(mfrow = c(2, 1))
plot(my_data$Close, type = "l", main = "Original")
plot(diff_1, type = "l", main = "First Difference")Differencing removes trends and makes data stationary. The 'd' in ARIMA represents how many times you difference your data.
A time shift. Lag-1 means 'one time period ago'. If you're looking at daily data, lag-7 means 'a week ago'.
# Create lagged versions of your data
library(dplyr)
df <- data.frame(
value = my_data$Close,
lag_1 = lag(my_data$Close, 1), # Yesterday
lag_7 = lag(my_data$Close, 7) # One week ago
)
# Lag plot to visualize autocorrelation
lag.plot(my_data$Close, lags = 4)Lags are the building blocks of time series analysis. AR models use lagged values as predictors, and lag plots help visualize dependencies.
Predicting today based on yesterday (and maybe the day before). It's regression, but using your own past values instead of other variables.
library(forecast)
# Fit AR(2) model - predicting from 2 past values
ar_model <- Arima(my_data$Close, order = c(2, 0, 0))
# View AR coefficients (phi values)
print(ar_model$coef)
# The "p" in ARIMA(p, d, q) is the AR orderThe 'p' in ARIMA. AR components capture how past values influence the present. PACF helps you determine how many AR terms to include.
Predicting today based on past prediction errors. If you overshot yesterday, adjust today accordingly.
library(forecast)
# Fit MA(2) model - using 2 past errors
ma_model <- Arima(my_data$Close, order = c(0, 0, 2))
# View MA coefficients (theta values)
print(ma_model$coef)
# The "q" in ARIMA(p, d, q) is the MA orderThe 'q' in ARIMA. MA components smooth out noise and capture short-term dependencies. ACF helps you determine how many MA terms to include.
The Swiss Army knife of time series. Combines AutoRegressive (past values), Integrated (differencing), and Moving Average (past errors).
library(forecast)
# Fit ARIMA(1, 1, 1) model
arima_model <- Arima(my_data$Close, order = c(1, 1, 1))
# Or let R choose the best parameters
auto_model <- auto.arima(my_data$Close)
# Forecast next 30 periods
forecast_result <- forecast(auto_model, h = 30)
plot(forecast_result)ARIMA is the go-to model for univariate time series forecasting. Master ARIMA and you can tackle most forecasting problems.
ARIMA(p, d, q) Components
A statistical test that asks: 'Is this data stationary?' It gives you a p-value to make the call.
library(tseries)
# Run Augmented Dickey-Fuller test
adf_result <- adf.test(my_data$Close)
# Check the p-value
print(adf_result$p.value)
# If p-value > 0.05: Non-stationary, increase d!
# If p-value < 0.05: Stationary, good to go!Before fitting ARIMA, you need to know if differencing is required. ADF test with p-value > 0.05 means you should increase 'd'.
Like ACF, but it removes the influence of intermediate lags. It shows the direct relationship between today and k days ago.
# Plot Partial Autocorrelation Function
pacf(my_data$Close, main = "PACF Plot")
# Get numeric PACF values
pacf_values <- pacf(my_data$Close, plot = FALSE)
print(pacf_values$acf[1:10])
# PACF cuts off sharply at lag p
# This tells you the AR order (p) for ARIMAPACF is your guide for choosing 'p' (AR order). Where PACF cuts off sharply is often the right 'p' value.
Sample PACF Plot (cuts off at lag 2)
Using your model to predict future values. The further out you go, the less confident you should be.
library(forecast)
# Fit model and forecast 30 periods ahead
model <- auto.arima(my_data$Close)
fc <- forecast(model, h = 30)
# Plot with confidence intervals
plot(fc, main = "30-Period Forecast")
# Access point forecasts and intervals
print(fc$mean) # Point forecasts
print(fc$lower) # Lower bounds (80% & 95%)
print(fc$upper) # Upper bounds (80% & 95%)The whole point! Good forecasts drive business decisions. Always report confidence intervals to show uncertainty.
Comparing group averages to see if at least one is different. Like asking 'Do these treatments actually have different effects?'
# Create sample data with 3 groups
group_A <- c(45, 48, 42, 47, 44)
group_B <- c(62, 65, 58, 61, 63)
group_C <- c(38, 35, 40, 37, 39)
# Run one-way ANOVA
data <- data.frame(
value = c(group_A, group_B, group_C),
group = factor(rep(c("A", "B", "C"), each = 5))
)
anova_result <- aov(value ~ group, data = data)
summary(anova_result)When you have 3+ groups to compare, ANOVA tells you if the differences are real or just random noise. It's the gateway to understanding experimental results.
Comparing Group Means
The probability of seeing your data (or more extreme) if the null hypothesis were true. Small p-value = 'something interesting is happening'.
# T-test example
group1 <- c(23, 25, 28, 22, 24)
group2 <- c(30, 32, 29, 31, 33)
result <- t.test(group1, group2)
# Extract p-value
print(result$p.value)
# Common thresholds:
# p < 0.05 -> Statistically significant
# p < 0.01 -> Highly significant
# p < 0.001 -> Very highly significantP-values guide decision-making. Below 0.05 typically means 'statistically significant', but context matters more than arbitrary thresholds.
A range of plausible values for your estimate. '95% confident the true value is between X and Y'.
# Sample data
data <- c(23, 25, 28, 22, 24, 26, 27, 25, 24, 26)
# Calculate 95% confidence interval for the mean
result <- t.test(data, conf.level = 0.95)
print(result$conf.int)
# Or manually:
mean_val <- mean(data)
se <- sd(data) / sqrt(length(data))
ci_lower <- mean_val - 1.96 * se
ci_upper <- mean_val + 1.96 * sePoint estimates are incomplete. Confidence intervals communicate uncertainty and help you understand how precise your estimates really are.
How spread out your data is. High variance = wild swings. Low variance = stable and predictable.
# Sample data
data <- c(23, 25, 28, 22, 24, 26, 27, 25, 24, 26)
# Variance (sample variance with n-1)
sample_var <- var(data)
print(sample_var)
# Standard deviation (sqrt of variance)
std_dev <- sd(data)
print(std_dev)
# Population variance (n instead of n-1)
pop_var <- var(data) * (length(data) - 1) / length(data)Variance appears everywhere in statistics. It's in confidence intervals, hypothesis tests, and model assumptions. Understanding spread is fundamental.
The classic average. Add everything up and divide by how many you have.
# Sample data
data <- c(23, 25, 28, 22, 24, 150) # Note the outlier!
# Calculate mean
mean_val <- mean(data)
print(mean_val) # Skewed by outlier
# Compare with median (robust to outliers)
median_val <- median(data)
print(median_val)
# Trimmed mean (removes extreme values)
trimmed_mean <- mean(data, trim = 0.1)The most common measure of central tendency. But beware: outliers can skew the mean dramatically. Consider the median for robust analysis.
Drawing the best-fit line through your data. Predicting Y using X.
# Simple linear regression
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(2.1, 4.3, 5.8, 8.1, 9.9, 12.2, 13.8, 16.1, 18.0, 20.2)
# Fit the model: Y = b0 + b1*X
model <- lm(y ~ x)
summary(model)
# Extract coefficients
intercept <- coef(model)[1] # b0
slope <- coef(model)[2] # b1
# Predict new values
predict(model, newdata = data.frame(x = c(11, 12)))Regression is the workhorse of predictive modeling. Time series models are essentially regression with past values as predictors.