{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Основы анализа данных с pandas & matplotlib\n", "\n", "\n", "[Pandas](http://pandas.pydata.org/) - библиотека Python для анализа данных для эффективной работы с объектами таблиц (DataFrame) с индексацией;\n", "\n", "[Matplotlib](https://matplotlib.org/#) - библиотека для визуализации данных и построения графиков и диаграмм, во многом повторяющая возможности MATLAB." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "TRAIN_CSV_PATH = 'titanic/train.csv'\n", "TEST_CSV_PATH = 'titanic/test.csv'\n", "\n", "# загрузка обучающей выборки данных в объект data frame и просмотр верха таблицы\n", "# обратить внимание на колонку с номером строки и на PassengerId\n", "# load training data to pandas data frame and see its top rows\n", "# pay attention to index column and PassengerId column\n", "df_train = pd.read_csv(TRAIN_CSV_PATH)\n", "df_train.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
1012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "3 1 3 \n", "4 1 1 \n", "5 0 3 \n", "6 0 3 \n", "7 0 1 \n", "8 0 3 \n", "9 1 3 \n", "10 1 2 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund, Mr. Owen Harris male 22.0 \n", "2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 \n", "3 Heikkinen, Miss. Laina female 26.0 \n", "4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "5 Allen, Mr. William Henry male 35.0 \n", "6 Moran, Mr. James male NaN \n", "7 McCarthy, Mr. Timothy J male 54.0 \n", "8 Palsson, Master. Gosta Leonard male 2.0 \n", "9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 \n", "10 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S \n", "4 1 0 113803 53.1000 C123 S \n", "5 0 0 373450 8.0500 NaN S \n", "6 0 0 330877 8.4583 NaN Q \n", "7 0 0 17463 51.8625 E46 S \n", "8 3 1 349909 21.0750 NaN S \n", "9 0 2 347742 11.1333 NaN S \n", "10 1 0 237736 30.0708 NaN C " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# используем PassengerId как колонку с уникальным индексом (номером), чтобы избежать двойной индексации\n", "# we should use PassengerId as index column to not have duplicated indexes\n", "df_train = pd.read_csv(TRAIN_CSV_PATH, index_col=0)\n", "df_train.head(10)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
8923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
8933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
8942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
8953Wirz, Mr. Albertmale27.0003151548.6625NaNS
8963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
\n", "
" ], "text/plain": [ " Pclass Name Sex \\\n", "PassengerId \n", "892 3 Kelly, Mr. James male \n", "893 3 Wilkes, Mrs. James (Ellen Needs) female \n", "894 2 Myles, Mr. Thomas Francis male \n", "895 3 Wirz, Mr. Albert male \n", "896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n", "\n", " Age SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "892 34.5 0 0 330911 7.8292 NaN Q \n", "893 47.0 1 0 363272 7.0000 NaN S \n", "894 62.0 0 0 240276 9.6875 NaN Q \n", "895 27.0 0 0 315154 8.6625 NaN S \n", "896 22.0 1 1 3101298 12.2875 NaN S " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# загрузка тестовых данных, для которых нужно сделать предсказание: выживет или не выживет пассажир,\n", "# снова используем index_col - первую (нулевую) колонку, это PassengerId как счетчик или номер строки\n", "# load test dataset - a dataset for which we have to make predictions whether a passenger survives\n", "df_test = pd.read_csv(TEST_CSV_PATH, index_col=0)\n", "df_test.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 891 entries, 1 to 891\n", "Data columns (total 11 columns):\n", "Survived 891 non-null int64\n", "Pclass 891 non-null int64\n", "Name 891 non-null object\n", "Sex 891 non-null object\n", "Age 714 non-null float64\n", "SibSp 891 non-null int64\n", "Parch 891 non-null int64\n", "Ticket 891 non-null object\n", "Fare 891 non-null float64\n", "Cabin 204 non-null object\n", "Embarked 889 non-null object\n", "dtypes: float64(2), int64(4), object(5)\n", "memory usage: 83.5+ KB\n" ] } ], "source": [ "# проверка общей информации о данных: количество строк, столбцов, пустых значений и типов данных\n", "# check overall dataset info, how many rows, columns, null values it has and what are column types\n", "df_train.info()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 418 entries, 892 to 1309\n", "Data columns (total 10 columns):\n", "Pclass 418 non-null int64\n", "Name 418 non-null object\n", "Sex 418 non-null object\n", "Age 332 non-null float64\n", "SibSp 418 non-null int64\n", "Parch 418 non-null int64\n", "Ticket 418 non-null object\n", "Fare 417 non-null float64\n", "Cabin 91 non-null object\n", "Embarked 418 non-null object\n", "dtypes: float64(2), int64(3), object(5)\n", "memory usage: 35.9+ KB\n" ] } ], "source": [ "df_test.info()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
1305NaN3Spector, Mr. WoolfmaleNaN00A.5. 32368.0500NaNS
1306NaN1Oliva y Ocana, Dona. Ferminafemale39.000PC 17758108.9000C105C
1307NaN3Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.2500NaNS
1308NaN3Ware, Mr. FrederickmaleNaN003593098.0500NaNS
1309NaN3Peter, Master. Michael JmaleNaN11266822.3583NaNC
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age \\\n", "PassengerId \n", "1305 NaN 3 Spector, Mr. Woolf male NaN \n", "1306 NaN 1 Oliva y Ocana, Dona. Fermina female 39.0 \n", "1307 NaN 3 Saether, Mr. Simon Sivertsen male 38.5 \n", "1308 NaN 3 Ware, Mr. Frederick male NaN \n", "1309 NaN 3 Peter, Master. Michael J male NaN \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1305 0 0 A.5. 3236 8.0500 NaN S \n", "1306 0 0 PC 17758 108.9000 C105 C \n", "1307 0 0 SOTON/O.Q. 3101262 7.2500 NaN S \n", "1308 0 0 359309 8.0500 NaN S \n", "1309 1 1 2668 22.3583 NaN C " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# unify all data to one data frame for more comprehensive analysis\n", "# объединим тестовый и тренировочный датасеты для общего анализа\n", "# средние, медианы, распределения для полного датасета будут другими\n", "# Using pandas.concat method https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html\n", "# default is unify by rows\n", "df = pd.concat([df_train, df_test], sort=False)\n", "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Данные в соревновании разбиты на **тренировочную (обучающую) и контрольную (тестовую) выборки** - train.csv & test.csv. Эти выборки в англоязычном data science часто называются training set и test set (validation set).\n", "\n", "Для тренировки или обучения предсказательной модели используется **обучающая выборка**. На её основе строится функция определения переменной-отклика (предсказания, результата) по признакам.\n", "\n", "**Тестовая или контрольная выборка** служит для оценки качества предсказаний модели на новых данных, т.е. данных, на которых модель не обучалась и которые она ещё не встречала.\n", "\n", "В данном случае организаторы соревнования создали для участников разбиение. Кроме того, в этом соревновании имеются контрольные данные двух видов: публичные (public test set) и приватные (private test set). Приватные данные закрыты, их нельзя получить. На публичных данных решение проверяется сразу же после загрузки, однако на них можно переобучиться то есть подогнать свою модель для очень точного предсказания на публичных данных. Время от времени проводится тестирование на приватных, закрытых для всех данных.\n", "\n", "Поэтому наивысшая отметка точности (score) 1 на публичных данных не говорит о высоком качестве предсказательной модели, а скорее всего говорит о точной подгонке к публичным тестовым данным, то есть о **переобучении** на них.\n", "\n", "Если тренировочную и тестовую выборки создаёт сам аналитик, то часто используется соотношение размеров выборок 70/30 или близкое к нему.\n", "\n", "Подробнее об оценке качества моделей можно почитать в статье [Открытый курс машинного обучения. Тема 3. Классификация, деревья решений и метод ближайших соседей (раздел Выбор параметров модели и кросс-валидация)](https://habr.com/ru/company/ods/blog/322534/#vybor-parametrov-modeli-i-kross-validaciya)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Общий анализ задачи, данных и признаков" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Класс задачи - **обучение с учителем**, так как имеются объекты (пассажиры и их признаки) и тренировочные данные с ответами.\n", "\n", "Задача - это **бинарная классификация** - предсказание, выжил пассажир (1) или нет (0).\n", "\n", "#### Categorical features / Категориальные признаки\n", "* **Pclass** - Passenger Class. Значения: 1, 2, 3. Класс каюты пассажира (1 - высокий, 3 - низкий).\n", "* **Sex** - Значения: male, female. Пол пассажира.\n", "* **Embarked** - Значения: S, C, Q. Порт посадки.\n", "\n", "#### Text features / Текстовые признаки\n", "* **Name** - имя пассажира.\n", "* **Ticket** - номер (название билета).\n", "* **Cabin** - номер каюты, много пропущенных значений. По большей части известно для первого класса (PClass = 1). Можно почитать обсуждение [Discussion: is cabin an important predictor?](https://www.kaggle.com/c/titanic/discussion/4693)\n", "\n", "#### Numeric features / Числовые признаки\n", "* **Age** - float. Возраст. Дробное число, если меньше 1 года.\n", "* **SibSp** - integer. Количество супругов или братьев-сестёр на корабле.\n", "* **Parch** - integer. Количество детей или родителей на корабле.\n", "* **Fare** - float. Цена билета.\n", "\n", "\n", "Общие наблюдения по данным: много пропусков в колонке Age (возраст) и Cabin (номер каюты) как в тренировочных, так и в тестовых данных.\n", "\n", "Есть несколько пропусков в колонке Embarked (порт посадки) и один пропуск в Fare (цена билета) в тестовых данных.\n", "\n", "Для качественной модели полезно будет заполнить эти пропуски в обеих таблицах или отбросить признаки с большими пропусками типа Cabin." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some useful info from [Titanic Kaggle Tutorial](https://www.kaggle.com/sashr07/kaggle-titanic-tutorial).\n", "\n", "\"In this case, understanding the Titanic disaster and specifically what variables might affect the outcome of survival is important. Anyone who has watched the movie Titanic would remember that **women and children were given preference to lifeboats** (as they were in real life). You would also remember the vast class disparity of the passengers.\n", "\n", "This indicates that **Age, Sex, and PClass may be good predictors** of survival.\"\n", "\n", "Вывод: имеет смысл изучить **предметную область, историю данных и внешнюю по отношению к данным информацию**: что в шлюпки сажали в первую очередь женщин и детей из высших классов." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Типы графиков, построение их в matplotlib на примере данных Titanic\n", "\n", "Рекомендации по созданию графиков по статье [Введение в визуализацию данных на Python & matplotlib](https://towardsdatascience.com/data-science-with-python-intro-to-data-visualization-and-matplotlib-5f799b7c6d82).\n", "\n", "* Выбрать подходящий тип графика.\n", "* Подписать оси графика. Это полезно для собственного исследования и показа другим участникам.\n", "* Добавить название графика.\n", "* Подписать категории данных, где это возможно.\n", "* Отметить наиболее интересные точки.\n", "* Задействовать цвета и размеры для большей информативности.\n", "\n", "Часто используемые типы графиков: линейный (linear plot), точечный график или диаграмма рассеяния (scatter plot), коробчатая диаграмма/ящик с усами (box plot), столбчатая диаграмма (bar chart), круговая диаграмма (pie chart). \n", "\n", "### 2.1. Bar chart - столбчатая диаграмма\n", "\n", "\n", "\n", "Диаграмма с прямоугольными зонами (столбцами), длины которых пропорциональны величинам, которые они отображают. Прямоугольные зоны могут быть расположены вертикально или горизонтально.\n", "\n", "Столбчатая диаграмма отображает сравнение нескольких дискретных категорий. Одна её ось перечисляет сравниваемые категории, другая показывает измеримую величину. Иногда столбчатые диаграммы отображают несколько величин для каждой сравниваемой категории.\n", "\n", "Полезны для сравнения значений категорий между собой, а не по отношению к целому.\n", "\n", "[Из википедии](https://ru.wikipedia.org/wiki/%D0%A1%D1%82%D0%BE%D0%BB%D0%B1%D1%87%D0%B0%D1%82%D0%B0%D1%8F_%D0%B4%D0%B8%D0%B0%D0%B3%D1%80%D0%B0%D0%BC%D0%BC%D0%B0)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 891\n", "1 319\n", "2 42\n", "4 22\n", "3 20\n", "8 9\n", "5 6\n", "Name: SibSp, dtype: int64\n", "0 891\n", "1 319\n", "2 42\n", "3 20\n", "4 22\n", "5 6\n", "8 9\n", "Name: SibSp, dtype: int64\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Show count (histogram) of different SiblingSpouses (SibSp) values - from 0 to 5\n", "# Покажем гистограмму количества пассажиров с 0-5 братьями-сёстрами или супругами на корабле\n", "sib_sps = df['SibSp'].value_counts()\n", "print(sib_sps)\n", "\n", "# need to sort the index (first column values) for more obvious data representation\n", "# значения индекса не отсортированы, полезно это исправить\n", "sib_sps.sort_index(ascending=True, inplace=True)\n", "print(sib_sps)\n", "\n", "# Add title and axis names\n", "plt.title('Histogram of Sibling-Spouses(SibSp) value counts')\n", "plt.xlabel('Number of siblings or spouses on Titanic')\n", "plt.ylabel('How many passengers has this number of siblings or spouses')\n", "\n", "sib_sps.plot(kind='bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2. Pie chart - круговая диаграмма\n", "\n", "\n", "\n", "Представляют данные в виде долей целого и обычно используются для сравнения групп. Рекомендуется отображать не более 7-10 категорий, чтобы не перегрузить диаграмму.\n", "\n", "Построим распределение пассажиров по классам в виде круговой диаграммы." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "3 709\n", "1 323\n", "2 277\n", "Name: Pclass, dtype: int64\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Show passenger count by classes as a pie chart\n", "# Покажем количество пассажиров в трёх классах на круговой диаграмме\n", "passenger_classes = df['Pclass'].value_counts()\n", "print(type(passenger_classes))\n", "print(passenger_classes)\n", "\n", "plt.title('Passengers distribution by classes 1, 2, 3')\n", "\n", "# this uses matplotlib underneath\n", "# эта визуализация основана на matplotlib, поэтому установка названия диаграммы выше сработала :)\n", "# параметр autopct показывает, как представлять данные на секторах в процентах\n", "passenger_classes.plot(kind='pie', autopct='%1.1f%%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Упражнение 1. Нарисовать круговую диаграмму пассажиров по порту посадки" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "# Exercise: draw pie chart for Embarked column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3. Scatter plot - точечный график\n", "\n", "\n", "\n", "**Диаграмма рассеяния (также точечная диаграмма, англ. scatter plot)** — математическая диаграмма, изображающая значения двух переменных в виде точек на декартовой плоскости.\n", "\n", "Хорошо иллюстрируют зависимость величин, помогают определить их корреляцию. Независимую переменную обычно отображают на горизонтальной оси, а зависимую - на вертикальной.\n", "\n", "[Из Википедии](https://ru.wikipedia.org/wiki/%D0%94%D0%B8%D0%B0%D0%B3%D1%80%D0%B0%D0%BC%D0%BC%D0%B0_%D1%80%D0%B0%D1%81%D1%81%D0%B5%D1%8F%D0%BD%D0%B8%D1%8F)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# increase plot size in the Jupyter window\n", "# увеличим размер графика\n", "plt.rcParams['figure.figsize'] = [14, 9]\n", "\n", "# scatter plot of women by age ticket fare on red - died, green survived\n", "# диаграмма рассеяния для женщин: по горизонтали - возраст, по вертикали цена билета, цвет точки - выжила или нет\n", "df_female = df[df.Sex == 'female']\n", "\n", "df_female_live = df_female[df_female.Survived == 1]\n", "df_female_dead = df_female[df_female.Survived == 0]\n", "\n", "plt.scatter(df_female_live.Age, df_female_live.Fare, s=50, c='g')\n", "plt.scatter(df_female_dead.Age, df_female_dead.Fare, s=50, c='r')\n", "\n", "# добавим название диаграммы и подписи осей\n", "# Add title and axis names\n", "plt.title('Women that survived (green) and died (red) represented by age and fare')\n", "plt.xlabel('Age')\n", "plt.ylabel('Fare')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Упражнение 2. Нарисовать график выживших и погибших мужчин с осями: возраст, цена билета" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Exercise: draw scatter plot of survived and died men by Age and Fare" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4. Ящик с усами (box plot)\n", "\n", "**Ящик с усами (box plot, box-and-whisker plot)** - способ показать распределение значений по пяти ключевым точкам распределения: минимум или нижнияя граница, первая квартиль, медиана, третья квартиль и максимум или верхняя граница.\n", "\n", "[Ящик с усами в Википедии](https://ru.wikipedia.org/wiki/%D0%AF%D1%89%D0%B8%D0%BA_%D1%81_%D1%83%D1%81%D0%B0%D0%BC%D0%B8).\n", "\n", "**Медиана** - это такое число выборки, что ровно половина из элементов выборки больше него, а другая половина меньше него (если все элементы выборки различны). [Подробнее о медиане - в Википедии](https://ru.wikipedia.org/wiki/%D0%9C%D0%B5%D0%B4%D0%B8%D0%B0%D0%BD%D0%B0_(%D1%81%D1%82%D0%B0%D1%82%D0%B8%D1%81%D1%82%D0%B8%D0%BA%D0%B0)).\n", "\n", "**Квартили** дают важную информацию о структуре распределения признака. Вместе с медианой они делят вариационный ряд (или выборку) на 4 равные части. Квартилей две: верхняя и нижняя квартиль. 25% значений меньше, чем нижняя квартиль, 75% значений меньше, чем верхняя квартиль.\n", "\n", "**Усы** ящика в простейшем случае - это наблюдения минимума и максимума (тогда выбросы не показаны на графике). Но концы усов могут определяться несколькими способами. Например, как разность первого квартиля и полутора межквартильных расстояний и сумма третьего квартиля и полутора межквартильных расстояний. Тогда всё, что не входит в усы графика, считается \"выбросами\" данной выброки.\n", "\n", "Pandas+matplotlib по умолчанию показывают \"усами\" значения Q3 + 1.5\\*IQR и Q1 - 1.5\\*IQR то есть полтора интерквартильных размаха вверх от верхней квартили и вниз от нижней ([по документации] (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html)).\n", "\n", "[Пример использования box plot для удаления выбросов](https://www.youtube.com/watch?v=qpihk7KepDI&t=124s) в значениях Fare и подсчёте пропущенных значений Fare по оставшемуся распределению без выбросов (на языке R).\n", "\n", "[Understanding Boxplots](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51) статья на английском про коробочные диаграммы (ящики с усами)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0,0.5,'Fare value')" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "boxplot = df.boxplot(column=['Fare'])\n", "\n", "plt.ylabel('Fare value')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В нашем случае значения Fare дают много выбросов над верхней границей графика." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Детальный анализ связи признаков и выживания\n", "### 3.1. Связь пола и выживания" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SexSurvived
PassengerId
1male0
2female1
3female1
4female1
5male0
\n", "
" ], "text/plain": [ " Sex Survived\n", "PassengerId \n", "1 male 0\n", "2 female 1\n", "3 female 1\n", "4 female 1\n", "5 male 0" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# выберем только колонки Sex (пол) и Survived (выживаемость), сохраним это в новый датафрейм\n", "# select only dataframe part with Sex and Survived colums, save to new dataframe\n", "df_sex = df_train.loc[:, ['Sex', 'Survived']]\n", "df_sex.head()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 577 entries, 1 to 891\n", "Data columns (total 2 columns):\n", "Sex 577 non-null object\n", "Survived 577 non-null int64\n", "dtypes: int64(1), object(1)\n", "memory usage: 13.5+ KB\n" ] } ], "source": [ "# select only male part\n", "# выберем только мужскую часть пассажиров\n", "df_male = df_sex[df_sex.Sex == 'male']\n", "df_male.info()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 314 entries, 2 to 889\n", "Data columns (total 2 columns):\n", "Sex 314 non-null object\n", "Survived 314 non-null int64\n", "dtypes: int64(1), object(1)\n", "memory usage: 7.4+ KB\n" ] } ], "source": [ "# select only female part\n", "# выберем только женскую часть пассажиров\n", "df_female = df_sex[df_sex.Sex == 'female']\n", "df_female.info()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.18890814558058924\n", "0.7420382165605095\n", "\n", "Percent of males survived is 19%\n", "Percent of females survived is 74%\n" ] } ], "source": [ "# Процент выживших мужчин - это среднее значение колонки Survived для них в датасете df_male\n", "# percent of males survived - is a mean of Survived values in df_male\n", "# Процент выживших женщин - это среднее значение колонки Survived для них в датасете df_female\n", "# percent of females survived - is a mean of Survived values in df_female\n", "\n", "male_survived_percent = df_male['Survived'].mean()\n", "female_survived_percent = df_female['Survived'].mean()\n", "\n", "print(male_survived_percent)\n", "print(female_survived_percent)\n", "\n", "print()\n", "\n", "print(\"Percent of males survived is {0:.0f}%\".format(male_survived_percent*100))\n", "print(\"Percent of females survived is {0:.0f}%\".format(female_survived_percent*100))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Survived\n", "Sex \n", "female 0.742038\n", "male 0.188908\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# короткий путь построения данных вероятности выживания по полу - pivot_table (сводная таблица) с агрегацией средним\n", "# survived by sex - shorter way is creating a pivot table with pandas with its default mean aggregation\n", "sex_pivot = df_train.pivot_table(index=\"Sex\", values=\"Survived\", aggfunc='mean')\n", "print(sex_pivot)\n", "\n", "sex_pivot.plot.bar()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2. Упражнение 3. Связь класса пассажира и выживания" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# Exercise: calculate percent of survived in each Pclass and show on bar chart or other plot" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "survived by class and sex\n", "Pclass Sex Survived\n", "1 female 1 0.968085\n", " 0 0.031915\n", " male 0 0.631148\n", " 1 0.368852\n", "2 female 1 0.921053\n", " 0 0.078947\n", " male 0 0.842593\n", " 1 0.157407\n", "3 female 0 0.500000\n", " 1 0.500000\n", " male 0 0.864553\n", " 1 0.135447\n", "Name: Survived, dtype: float64\n" ] } ], "source": [ "# Check how people survived by their class and sex\n", "print(\"survived by class and sex\")\n", "print(df_train.groupby([\"Pclass\", \"Sex\"])[\"Survived\"].value_counts(normalize=True))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "B96 B98 4\n", "C23 C25 C27 4\n", "G6 4\n", "E101 3\n", "F2 3\n", "Name: Cabin, dtype: int64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check how Cabin feature looks\n", "# there are some strange cabin values like 2 or 3 cabins for one passenger, no idea ho to process them\n", "# просмотр, как выглядит признак Cabin (каюта), для некоторых пассажиров указано несколько кают\n", "df_train['Cabin'].value_counts().head()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CA. 2343 7\n", "347082 7\n", "1601 7\n", "3101295 6\n", "CA 2144 6\n", "347088 6\n", "382652 5\n", "S.O.C. 14879 5\n", "PC 17757 4\n", "113781 4\n", "Name: Ticket, dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check how Ticket feature looks\n", "# also quite strange data without any clear form to process\n", "# просмотр, как выглядит признак Ticket (билет)\n", "df_train['Ticket'].value_counts().head(10)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedAgeGrp
PassengerId
10.03Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS2.0
21.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C3.0
31.03Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS2.0
41.01Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S3.0
50.03Allen, Mr. William Henrymale35.0003734508.0500NaNS3.0
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0.0 3 \n", "2 1.0 1 \n", "3 1.0 3 \n", "4 1.0 1 \n", "5 0.0 3 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund, Mr. Owen Harris male 22.0 \n", "2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 \n", "3 Heikkinen, Miss. Laina female 26.0 \n", "4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "5 Allen, Mr. William Henry male 35.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked AgeGrp \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S 2.0 \n", "2 1 0 PC 17599 71.2833 C85 C 3.0 \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S 2.0 \n", "4 1 0 113803 53.1000 C123 S 3.0 \n", "5 0 0 373450 8.0500 NaN S 3.0 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check survival by age groups by creating such groups\n", "# Проверим выживаемость по возрастным группам, например по десяткам лет. Создадим такой новый признак.\n", "\n", "# https://stackoverflow.com/questions/5584586/find-the-division-remainder-of-a-number\n", "# In division a % b give modulo\n", "# В делении a % b даёт остаток от деления\n", "# a / b gives divisor as float\n", "# a // b gives divisor as integer, which will serve as a group\n", "def make_age_group(row):\n", " return row['Age'] // 10\n", "\n", "df['AgeGrp'] = df.apply(make_age_group, axis=1)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "survived by AgeGroup\n", "AgeGrp Survived\n", "0.0 1.0 0.612903\n", " 0.0 0.387097\n", "1.0 0.0 0.598039\n", " 1.0 0.401961\n", "2.0 0.0 0.650000\n", " 1.0 0.350000\n", "3.0 0.0 0.562874\n", " 1.0 0.437126\n", "4.0 0.0 0.617978\n", " 1.0 0.382022\n", "5.0 0.0 0.583333\n", " 1.0 0.416667\n", "6.0 0.0 0.684211\n", " 1.0 0.315789\n", "7.0 0.0 1.000000\n", "8.0 1.0 1.000000\n", "Name: Survived, dtype: float64\n" ] } ], "source": [ "# Check how each age group survives\n", "# Проверим, какая выживаемость в каждой возрастной группе\n", "print(\"survived by AgeGroup\")\n", "print(df.groupby([\"AgeGrp\"])[\"Survived\"].value_counts(normalize=True))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.0 344\n", "3.0 232\n", "1.0 143\n", "4.0 135\n", "0.0 82\n", "5.0 70\n", "6.0 32\n", "7.0 7\n", "8.0 1\n", "Name: AgeGrp, dtype: int64\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print(df['AgeGrp'].value_counts())\n", "\n", "indexes = df['AgeGrp'].value_counts().index\n", "values = df['AgeGrp'].value_counts().values\n", "\n", "plt.bar(indexes, values)\n", " \n", "# Add title and axis names\n", "plt.title('Passengers by age group')\n", "plt.xlabel('Age groups (example 1 is age from 10 to 20)')\n", "plt.ylabel('Number of passengers')\n", " \n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Упражнение 4 (на дом). Определить процент выживших в каждой группе возраста " ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# Exercise: calculate percent of survived in each Age Group" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Создание простейшей модели, вывод предсказания и посылка на Kaggle\n", "\n", "Построим простую модель, напоминающую дерево решений, которая предсказывает, что женщины и дети из класса 1 и 2 выживут, мужчины из 3 точно погибнут, а дети любого пола до 10 лет выживут, только если у них 1 или более родителей на корабле.\n", "\n", "### 4.1. Функция предсказания выживания" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# this model has a bit higher score\n", "def get_survival(df_row):\n", " is_female = df_row['Sex'] == 'female'\n", " pclass = df_row['Pclass']\n", " parent_child_count = df_row['Parch']\n", " siblings = df_row['SibSp']\n", " age_less_10 = df_row['AgeGrp'] == 0\n", " \n", " if (is_female):\n", " if (pclass != '3'):\n", " return 1\n", " elif (age_less_10 and parent_child_count > 0):\n", " return 1\n", " else:\n", " return 0\n", " else:\n", " if (pclass == '3'):\n", " return 0\n", " elif (age_less_10 and parent_child_count > 0):\n", " return 1\n", " else:\n", " return 0\n", " \n", "# this model has lower score\n", "# для сравнения также построим простейшую модель: выживают только женщины\n", "def get_simple_female_survive(df_row):\n", " return int(df_row['Sex'] == 'female');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2. Создание файла с предсказанием" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 800\n", "1 509\n", "Name: survive_predict, dtype: int64\n", "\n", "\n", "Int64Index: 418 entries, 892 to 1309\n", "Data columns (total 1 columns):\n", "survive_predict 418 non-null int64\n", "dtypes: int64(1)\n", "memory usage: 6.5 KB\n", "None\n", " Survived\n", "PassengerId \n", "892 0\n", "893 1\n", "894 0\n", "895 0\n", "896 1\n" ] } ], "source": [ "# Создаём новую колонку - предсказание выживаемости с помощью функции apply - применить функцию к датасету\n", "df['survive_predict'] = df.apply(get_survival, axis=1)\n", "print (df['survive_predict'].value_counts())\n", "\n", "# Отделим только тестовую часть датастета для вывода - это наши ряды с 891 и ниже\n", "# Create output dataset with only predicted rows from test dataset\n", "df_out = df.iloc[891:, [12]]\n", "print()\n", "print(df_out.info())\n", "\n", "# Rename our new prediction to Survived column\n", "# переименуем предсказание в Survived, такое название требуется для отправки результата\n", "df_out = df_out.rename(index=str, columns={\"survive_predict\": \"Survived\"})\n", "print(df_out.head())\n", "\n", "# write output to file\n", "# запись ответа в файл\n", "df_out.to_csv(\"evaluation_submission.csv\") # got 0.77033 score and place 6079 on public leaderboard\n", "\n", "# Полученный файл evaluation_submission.csv загружаем на Kaggle\n", "# File evaluation_submission.csv should be uploaded to Kaggle competition page in \"Submit Predictions\" section" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Домашнее задание и дополнительное чтение\n", "\n", "### 5.1. Домашнее задание\n", "\n", "Улучшить данную модель любым способом и получить оценку выше, чем текущая простая модель (больше 0.77033).\n", "\n", "### 5.2. Дополнительное чтение\n", "\n", "1. [«Айсберг вместо Оскара!» или как я пробовал освоить азы DataScience на kaggle](https://habr.com/ru/post/331992/) Полезно для начинающих и для понимания, как использовать Kaggle для учёбы.\n", "2. [Kaggle и Titanic — еще одно решение задачи с помощью Python](https://habr.com/ru/post/274171/)\n", "3. [Титаник на Kaggle: вы не дочитаете этот пост до конца](https://habr.com/ru/company/mlclass/blog/270973/) Глубокий анализ данных, код на языке R.\n", "4. [Много обучающих статей на Kaggle - Tutorials on Titanic](https://www.kaggle.com/c/titanic#tutorials)\n", "5. [Data Science with Python: Intro to Data Visualization with Matplotlib](https://towardsdatascience.com/data-science-with-python-intro-to-data-visualization-and-matplotlib-5f799b7c6d82)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }