{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Введение в анализ данных\n", "\n", "## Семинар 2. Знакомство с библиотеками для анализа данных\n", "\n", "### Полезная информация\n", "\n", " - [Страница курса на вики](http://wiki.cs.hse.ru/Майнор_Интеллектуальный_анализ_данных/Введение_в_анализ_данных)\n", " - [Таблица с оценками](https://docs.google.com/spreadsheets/d/1jZL_-ELf0Ogj2XHa6VVbkg8vrInycv2-Z9UR5keLDfM/edit?usp=sharing)\n", " - Почта курса *hse.minor.dm@gmail.com* (Формат темы: \"[ИАД-NN] - Вопрос - Фамилия Имя Отчество\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Что было в прошлый раз\n", "\n", "**Примеры задач** \n", " - Анализ тональности текст\n", " - Найти наиболее подходящее место для ресторана\n", " \n", "**Обозначения**\n", " - x — объект\n", " - $\\mathbb{X}$ — множество объектов\n", " - y — ответ на объекте x\n", " - $\\mathbb{Y}$ — пространство ответов\n", " \n", "**Обучающая выборка**\n", "В зависимости от задачи $X = (x_i, y_i)_{i=1}^l$, если известный $y_i$\n", "\n", "**Признаки** Описание объекта $x = (x^1, ..., x^d)$ (вектор)\n", "\n", "**Алгоритм**\n", "Функция, предсказывающая ответ для любого объекта\n", "\n", "**Функция потерь**\n", "Способ измерить корректность ответов алгоритма\n", "\n", "**Функционал качества**\n", "Способ измерить качество алгоритма на выборке" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Библиотека Numpy\n", "\n", "Библиотека языка Python, позволяющая [удобно] работать с многомерными массивами и матрицами, содержащая математические функции.\n", "\n", " - [numpy](http://www.numpy.org)\n", " - [100 numpy exercises](http://www.labri.fr/perso/nrougier/teaching/numpy.100/)\n", " - [A Crash Course in Python for Scientists (numpy)](http://nbviewer.jupyter.org/gist/rpmuller/5920182#II.-Numpy-and-Scipy)\n", " - [stackoverflow!](http://stackoverflow.com/questions/tagged/numpy)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Массив — тип данных ([numpy.array](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.array.html)). Способ задания — передать последовательность в качестве первого параметра" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1 2 3 4 5 6]\n", "[[1 2 3]\n", " [4 5 6]]\n" ] } ], "source": [ "print np.array([1,2,3,4,5,6]) \n", "print np.array([[1,2,3], [4,5,6]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Обратите внимание, что следующий код работать **не будет**:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "ename": "ValueError", "evalue": "only 2 non-keyword arguments accepted", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mprint\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m5\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m6\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mValueError\u001b[0m: only 2 non-keyword arguments accepted" ] } ], "source": [ "print np.array(1,2,3,4,5,6) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "При необходимости можно явно указать тип хранимых значений (dtype):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 1. 2. 3. 4. 5. 6.]\n" ] } ], "source": [ "print np.array([1,2,3,4,5,6], dtype=float)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "Иногда бывает полезно быстро обратиться к документации функции, чтобы посмотреть параметры, которые она принимает на вход. Для этого в ipython-notebook встроен удобный интерфейс, а именно следующая команда:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "?np.array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "покажет [docstring](https://en.wikipedia.org/wiki/Docstring) данной функции\n", "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Функции numpy обычно возвращают np.array" ] }, { "cell_type": "code", "execution_count": 132, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ,\n", " 5.5, 6. , 6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5])" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = np.arange(0, 10, 0.5) \n", "a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "np.arange - аналог range в Python, которому можно передать нецелочисленный шаг" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Могут возникнуть более сложные ситуации: когда вам необходимо разделить некоторый отрезок на $n$ частей. Зная функцию arange и вычислив правильный шаг можно это сделать, однако это не всегда удобно. Поэтому в numpy есть функция linspace: " ] }, { "cell_type": "code", "execution_count": 125, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0. , 0.83333333, 1.66666667, 2.5 ,\n", " 3.33333333, 4.16666667, 5. , 5.83333333,\n", " 6.66666667, 7.5 , 8.33333333, 9.16666667, 10. ])" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.linspace(0, 10, num=13)" ] }, { "cell_type": "code", "execution_count": 121, "metadata": { "collapsed": true }, "outputs": [], "source": [ "?np.linspace" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Чтобы посмотреть какой размер имеет наш массив, можно воспользоваться полем shape:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(20,)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Обратите внимание* на различие в строчках ниже. Если вы хотите задать матрицу размера n x m, то в каждой последовательности должно быть строго m элементов. " ] }, { "cell_type": "code", "execution_count": 128, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(3,)" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.array([[1,2,3], [4,5,6], [1, 1]]).shape" ] }, { "cell_type": "code", "execution_count": 129, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(3, 3)" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.array([[1,2,3], [4,5,6], [1, 1, 1]]).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Чтобы узнать размерность вашего массива, можно воспользоваться полем ndim:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.ndim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Если необходимо изменить размеры массива, в numpy есть функция reshape:" ] }, { "cell_type": "code", "execution_count": 133, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5],\n", " [ 5. , 5.5, 6. , 6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5]])" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = np.reshape(a, (2, 10))\n", "b" ] }, { "cell_type": "code", "execution_count": 134, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0. , 0.5],\n", " [ 1. , 1.5],\n", " [ 2. , 2.5],\n", " [ 3. , 3.5],\n", " [ 4. , 4.5],\n", " [ 5. , 5.5],\n", " [ 6. , 6.5],\n", " [ 7. , 7.5],\n", " [ 8. , 8.5],\n", " [ 9. , 9.5]])" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = np.reshape(a, (10, 2))\n", "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "При этом всегда можно перейти к \"одномерному\" представлению массива, а именно создать одномерный массив из всех элементов матрицы с помощью одной функции:" ] }, { "cell_type": "code", "execution_count": 135, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ,\n", " 5.5, 6. , 6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5])" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.flatten()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(2, 10)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.shape" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.ndim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Для того чтобы понять, какого типа элементы элементы в массиве, есть поле dtype" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'float64'" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.dtype.name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Также можно создавать array специального вида при помощи функций zeros(), ones(), empty(), identity() (тип данных по умолчанию — float64):" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 0., 0.],\n", " [ 0., 0., 0.],\n", " [ 0., 0., 0.]])" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.zeros((3, 3))" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 1., 1., 1.],\n", " [ 1., 1., 1.],\n", " [ 1., 1., 1.]])" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.ones((3, 3))" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 0., 0.],\n", " [ 0., 0., 0.],\n", " [ 0., 0., 0.]])" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.empty((3, 3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Здесь важно помнить что функция empty() создает \"пустой\" массив, то есть в качестве значений может лежать какой-то мусор." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 1., 0., 0.],\n", " [ 0., 1., 0.],\n", " [ 0., 0., 1.]])" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.identity(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Базовые операции с матрицами**\n", "\n", "Все арифметические операции над матрицами производятся поэлементно" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "A = np.arange(0, 9).reshape(3, 3)\n", "B = np.arange(1, 10).reshape(3, 3)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[0, 1, 2],\n", " [3, 4, 5],\n", " [6, 7, 8]])" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 1, 3, 5],\n", " [ 7, 9, 11],\n", " [13, 15, 17]])" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A + B" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 2, 6],\n", " [12, 20, 30],\n", " [42, 56, 72]])" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A * B" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[-1, -1, -1],\n", " [-1, -1, -1],\n", " [-1, -1, -1]])" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A - B" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[0, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 0]])" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A / B" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[1, 2, 3],\n", " [4, 5, 6],\n", " [7, 8, 9]])" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A + 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Если же хочется выполнить привычное матричное умножение, необходимо воспользоваться функцией dot:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 18, 21, 24],\n", " [ 54, 66, 78],\n", " [ 90, 111, 132]])" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.dot(A, B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Еще одно большое преимущество numpy — над матрицами определены многие стандартные операции (нахождение минимума, максимума и пр.) Важно помнить, что все эти функции находятся в пакете numpy:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(0, 8, 36)" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.min(A), np.max(A), np.sum(A)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Когда требуется выполнить аналогичные операции, но по некоторой размерности (например, найти минимальный элемент в каждой строке), можно указать параметр *axis*:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(array([0, 3, 6]), array([2, 5, 8]), array([ 3, 12, 21]))" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.min(A, axis=1), np.max(A, axis=1), np.sum(A, axis=1)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(array([0, 1, 2]), array([6, 7, 8]), array([ 9, 12, 15]))" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.min(A, axis=0), np.max(A, axis=0), np.sum(A, axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Помимо привычный функций, в пакете numpy есть много и математических функций, которые можно выполнять над матрицами:" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0. , 1. , 1.41421356],\n", " [ 1.73205081, 2. , 2.23606798],\n", " [ 2.44948974, 2.64575131, 2.82842712]])" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sqrt(A)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Индексация**\n", "\n", "Одномерные массивы осуществляют операции индексирования, срезов и итераций очень схожим образом с обычными списками и другими последовательностями Python." ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,\n", " 18, 19, 20, 21, 22, 23, 24, 25, 26, 27])" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = np.arange(1, 28)\n", "a" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[0]" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "27" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[-1]" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,\n", " 19, 20, 21, 22, 23, 24, 25, 26])" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[1:-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Но в numpy появляется новый тип индексации, а именно индексация с помощью массивов (которая работает по всем размерностям):" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([1, 3, 4])" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[[0, 2, 3]]" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9],\n", " [10, 11, 12, 13, 14, 15, 16, 17, 18],\n", " [19, 20, 21, 22, 23, 24, 25, 26, 27]])" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = np.reshape(a, (3, 9))\n", "b" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 3, 4, 5, 6, 7, 8, 9])" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[0]" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 2, 11, 20])" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[:, 1]" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[[ 1, 2, 3],\n", " [ 4, 5, 6],\n", " [ 7, 8, 9]],\n", "\n", " [[10, 11, 12],\n", " [13, 14, 15],\n", " [16, 17, 18]],\n", "\n", " [[19, 20, 21],\n", " [22, 23, 24],\n", " [25, 26, 27]]])" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = np.reshape(a, (3, 3, 3))\n", "c" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[1, 2, 3],\n", " [4, 5, 6],\n", " [7, 8, 9]])" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c[0]" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 2, 5, 8],\n", " [11, 14, 17],\n", " [20, 23, 26]])" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c[:,:,1]" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[[ 2, 3],\n", " [ 5, 6],\n", " [ 8, 9]],\n", "\n", " [[11, 12],\n", " [14, 15],\n", " [17, 18]],\n", "\n", " [[20, 21],\n", " [23, 24],\n", " [26, 27]]])" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c[:, :, [1, 2]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "А что еще более интересно — можно делать булевую индексацию. То есть вам необходимо передать массив из True/False, соответствующий какие элементы брать:" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 3, 4, 5, 8])" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = np.arange(10)\n", "good = np.array([True, False, False, True, True, True, False, False, True, False])\n", "a[good]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В начале может показаться что это долго (в обычной индексации нам необходимо было передать несколько индексов, а здесь — целый массив), однако отвлечемся немного на numpy. Вспомним, что к массивам можно применять обычные операции. Логические операции не являются исключением: " ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ True, False, True, False, True, False, True, False, True, False], dtype=bool)" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a % 2 == 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Таким образом получаем, что чтобы найти в массиве только четные числа, нужно выполнить следующую строчку:" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 2, 4, 6, 8])" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[a % 2 == 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Библиотека Scipy\n", "[scipy](http://nbviewer.jupyter.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-3-Scipy.ipynb)\n", "(import scipy as sp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Библиотека Pandas\n", "\n", "\"...fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive\"\n", "\n", " - [pandas](http://pandas.pydata.org)\n", " - [Введение](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)\n", " - [Индексация](http://pandas.pydata.org/pandas-docs/stable/indexing.html)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Попробуем начать работать с модельными данными. Для начала посмотрим как можно загрузить данные с помощью Python." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Feature1,Weight,Height,Bla-bla,Size,Class\n", "10.0,12,344,0,23.0,Class1\n", "7.2,12,208,0,18.0,Class2\n", "19.0,11,344,1,21.0,Class4\n", "7.2,13,208,0,20.0,Class2\n", "9.2,20,208,0,17.0,Class1\n", "19.0,11,254,2,11.0,Class3\n", "\n" ] } ], "source": [ "print(open('data.txt').read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Можно увидеть, что первая строчка является заголовком, а каждая следующая строчка — описанием объекта. Последний столбец является целевой меткой. Все остальные столбцы являются признаковым описанием объектов или просто признаками.\n", "\n", "\n", "Считаем данные и сохраним в удобном для нас формате" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[{'features': [10.0, 12.0, 344.0, 0.0, 23.0], 'label': 'Class1'},\n", " {'features': [7.2, 12.0, 208.0, 0.0, 18.0], 'label': 'Class2'},\n", " {'features': [19.0, 11.0, 344.0, 1.0, 21.0], 'label': 'Class4'},\n", " {'features': [7.2, 13.0, 208.0, 0.0, 20.0], 'label': 'Class2'},\n", " {'features': [9.2, 20.0, 208.0, 0.0, 17.0], 'label': 'Class1'},\n", " {'features': [19.0, 11.0, 254.0, 2.0, 11.0], 'label': 'Class3'}]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from itertools import islice\n", "\n", "points = []\n", "for line in islice(open('data.txt'), 1, None):\n", " columns = line.strip().split(',')\n", " features = [float(feature) for feature in columns[:5]]\n", " label = columns[5]\n", " points.append({'features': features, 'label': label})\n", " \n", "points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Попробуем считать данные с помощью Pandas. Данные будут сохранены в специализированный объект класса pandas.DataFrame, который представляет из себя таблицу с проименованными строками и столбцами." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Feature1WeightHeightBla-blaSizeClass
010.012344023Class1
17.212208018Class2
219.011344121Class4
37.213208020Class2
49.220208017Class1
519.011254211Class3
\n", "
" ], "text/plain": [ " Feature1 Weight Height Bla-bla Size Class\n", "0 10.0 12 344 0 23 Class1\n", "1 7.2 12 208 0 18 Class2\n", "2 19.0 11 344 1 21 Class4\n", "3 7.2 13 208 0 20 Class2\n", "4 9.2 20 208 0 17 Class1\n", "5 19.0 11 254 2 11 Class3" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('data.txt')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Можно делать срезы по именам колонки" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 12\n", "1 12\n", "2 11\n", "3 13\n", "4 20\n", "5 11\n", "Name: Weight, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Weight']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "а также по строкам" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Feature1WeightHeightBla-blaSizeClass
21911344121Class4
\n", "
" ], "text/plain": [ " Feature1 Weight Height Bla-bla Size Class\n", "2 19 11 344 1 21 Class4" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[2:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Посчитает какие и сколько раз признак \"Bla-bla\" принимал значения" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 4\n", "2 1\n", "1 1\n", "Name: Bla-bla, dtype: int64" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Bla-bla'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Попробуем посмотреть на общую статистику по признакам" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Feature1WeightHeightBla-blaSize
count6.0000006.0000006.0000006.000006.000000
mean11.93333313.166667261.0000000.5000018.333333
std5.5837863.43025866.7143160.836664.179314
min7.20000011.000000208.0000000.0000011.000000
25%7.70000011.250000208.0000000.0000017.250000
50%9.60000012.000000231.0000000.0000019.000000
75%16.75000012.750000321.5000000.7500020.750000
max19.00000020.000000344.0000002.0000023.000000
\n", "
" ], "text/plain": [ " Feature1 Weight Height Bla-bla Size\n", "count 6.000000 6.000000 6.000000 6.00000 6.000000\n", "mean 11.933333 13.166667 261.000000 0.50000 18.333333\n", "std 5.583786 3.430258 66.714316 0.83666 4.179314\n", "min 7.200000 11.000000 208.000000 0.00000 11.000000\n", "25% 7.700000 11.250000 208.000000 0.00000 17.250000\n", "50% 9.600000 12.000000 231.000000 0.00000 19.000000\n", "75% 16.750000 12.750000 321.500000 0.75000 20.750000\n", "max 19.000000 20.000000 344.000000 2.00000 23.000000" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Для удаления какого-либо столбца можно воспользоваться методом drop" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Feature1WeightBla-blaSizeClass
010.012023Class1
17.212018Class2
219.011121Class4
37.213020Class2
49.220017Class1
519.011211Class3
\n", "
" ], "text/plain": [ " Feature1 Weight Bla-bla Size Class\n", "0 10.0 12 0 23 Class1\n", "1 7.2 12 0 18 Class2\n", "2 19.0 11 1 21 Class4\n", "3 7.2 13 0 20 Class2\n", "4 9.2 20 0 17 Class1\n", "5 19.0 11 2 11 Class3" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop('Height', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Удаление строк делается похожим образом" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Feature1WeightHeightBla-blaSizeClass
219.011344121Class4
37.213208020Class2
49.220208017Class1
519.011254211Class3
\n", "
" ], "text/plain": [ " Feature1 Weight Height Bla-bla Size Class\n", "2 19.0 11 344 1 21 Class4\n", "3 7.2 13 208 0 20 Class2\n", "4 9.2 20 208 0 17 Class1\n", "5 19.0 11 254 2 11 Class3" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop([0, 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Обратите внимание, что исходный датафрейм остается неизменным" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Feature1WeightHeightBla-blaSizeClass
010.012344023Class1
17.212208018Class2
219.011344121Class4
37.213208020Class2
49.220208017Class1
519.011254211Class3
\n", "
" ], "text/plain": [ " Feature1 Weight Height Bla-bla Size Class\n", "0 10.0 12 344 0 23 Class1\n", "1 7.2 12 208 0 18 Class2\n", "2 19.0 11 344 1 21 Class4\n", "3 7.2 13 208 0 20 Class2\n", "4 9.2 20 208 0 17 Class1\n", "5 19.0 11 254 2 11 Class3" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Вспомним про numpy" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 10. , 12. , 344. , 0. , 23. ],\n", " [ 7.19999981, 12. , 208. , 0. , 18. ],\n", " [ 19. , 11. , 344. , 1. , 21. ],\n", " [ 7.19999981, 13. , 208. , 0. , 20. ],\n", " [ 9.19999981, 20. , 208. , 0. , 17. ],\n", " [ 19. , 11. , 254. , 2. , 11. ]], dtype=float32)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mat = df.drop('Class', axis=1).values.astype(np.float32)\n", "mat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Есть два способа конкатенации матриц: \n", " - вертикальная (в начале идут строки первой матрицы, затем — второй)\n", " - горизонтальная (в конец строк первой матрицы дописываются значения соответствующих строк второй матрицы)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 1.00000000e+01, 1.20000000e+01, 3.44000000e+02,\n", " 0.00000000e+00, 2.30000000e+01],\n", " [ 7.19999981e+00, 1.20000000e+01, 2.08000000e+02,\n", " 0.00000000e+00, 1.80000000e+01],\n", " [ 1.90000000e+01, 1.10000000e+01, 3.44000000e+02,\n", " 1.00000000e+00, 2.10000000e+01],\n", " [ 7.19999981e+00, 1.30000000e+01, 2.08000000e+02,\n", " 0.00000000e+00, 2.00000000e+01],\n", " [ 9.19999981e+00, 2.00000000e+01, 2.08000000e+02,\n", " 0.00000000e+00, 1.70000000e+01],\n", " [ 1.90000000e+01, 1.10000000e+01, 2.54000000e+02,\n", " 2.00000000e+00, 1.10000000e+01],\n", " [ 1.00000000e+02, 1.44000000e+02, 1.18336000e+05,\n", " 0.00000000e+00, 5.29000000e+02],\n", " [ 5.18399963e+01, 1.44000000e+02, 4.32640000e+04,\n", " 0.00000000e+00, 3.24000000e+02],\n", " [ 3.61000000e+02, 1.21000000e+02, 1.18336000e+05,\n", " 1.00000000e+00, 4.41000000e+02],\n", " [ 5.18399963e+01, 1.69000000e+02, 4.32640000e+04,\n", " 0.00000000e+00, 4.00000000e+02],\n", " [ 8.46399994e+01, 4.00000000e+02, 4.32640000e+04,\n", " 0.00000000e+00, 2.89000000e+02],\n", " [ 3.61000000e+02, 1.21000000e+02, 6.45160000e+04,\n", " 4.00000000e+00, 1.21000000e+02]], dtype=float32)" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.vstack([mat, mat ** 2])" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 1.00000000e+01, 1.20000000e+01, 3.44000000e+02,\n", " 0.00000000e+00, 2.30000000e+01, 1.00000000e+02,\n", " 1.44000000e+02, 1.18336000e+05, 0.00000000e+00,\n", " 5.29000000e+02],\n", " [ 7.19999981e+00, 1.20000000e+01, 2.08000000e+02,\n", " 0.00000000e+00, 1.80000000e+01, 5.18399963e+01,\n", " 1.44000000e+02, 4.32640000e+04, 0.00000000e+00,\n", " 3.24000000e+02],\n", " [ 1.90000000e+01, 1.10000000e+01, 3.44000000e+02,\n", " 1.00000000e+00, 2.10000000e+01, 3.61000000e+02,\n", " 1.21000000e+02, 1.18336000e+05, 1.00000000e+00,\n", " 4.41000000e+02],\n", " [ 7.19999981e+00, 1.30000000e+01, 2.08000000e+02,\n", " 0.00000000e+00, 2.00000000e+01, 5.18399963e+01,\n", " 1.69000000e+02, 4.32640000e+04, 0.00000000e+00,\n", " 4.00000000e+02],\n", " [ 9.19999981e+00, 2.00000000e+01, 2.08000000e+02,\n", " 0.00000000e+00, 1.70000000e+01, 8.46399994e+01,\n", " 4.00000000e+02, 4.32640000e+04, 0.00000000e+00,\n", " 2.89000000e+02],\n", " [ 1.90000000e+01, 1.10000000e+01, 2.54000000e+02,\n", " 2.00000000e+00, 1.10000000e+01, 3.61000000e+02,\n", " 1.21000000e+02, 6.45160000e+04, 4.00000000e+00,\n", " 1.21000000e+02]], dtype=float32)" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.hstack([mat, mat ** 2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Вернемся к датафреймам:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789
010.012344023100.0000001441183360529
17.21220801851.839996144432640324
219.011344121361.0000001211183361441
37.21320802051.839996169432640400
49.22020801784.639999400432640289
519.011254211361.000000121645164121
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9\n", "0 10.0 12 344 0 23 100.000000 144 118336 0 529\n", "1 7.2 12 208 0 18 51.839996 144 43264 0 324\n", "2 19.0 11 344 1 21 361.000000 121 118336 1 441\n", "3 7.2 13 208 0 20 51.839996 169 43264 0 400\n", "4 9.2 20 208 0 17 84.639999 400 43264 0 289\n", "5 19.0 11 254 2 11 361.000000 121 64516 4 121" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newdf = pd.DataFrame(np.hstack([mat, mat ** 2]))\n", "newdf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](https://s-media-cache-ak0.pinimg.com/236x/bd/c5/e0/bdc5e05aaf6ec65bcd6ba935b47bf2a2.jpg)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }