{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Интеллектуальный анализ данных. Введение в анализ данных\n", "\n", "## Семинар 3. Библиотека Numpy\n", "\n", "### 0. Полезная информация\n", " - [Страница курса на вики](http://wiki.cs.hse.ru/Майнор_Интеллектуальный_анализ_данных/Введение_в_анализ_данных)\n", " - [Страница семинаров на вики](http://wiki.cs.hse.ru/Майнор_Интеллектуальный_анализ_данных/Введение_в_анализ_данных/ИАД-11,12)\n", " - [Таблица с оценками](https://docs.google.com/spreadsheets/d/1jZL_-ELf0Ogj2XHa6VVbkg8vrInycv2-Z9UR5keLDfM/edit?usp=sharing)\n", " - Почта курса *hse.minor.dm@gmail.com* (Формат темы: \"[ИАД-NN] - Вопрос - Фамилия Имя Отчество\")\n", " - Виртуальная машина для майнора (подробности см. на странице семинара)\n", " - **Подписаться на рассылку**: написать пустое письмо на hse-minor-datamining-2+subscribe@googlegroups.com\n", " - Первое ДЗ!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Что было в прошлый раз\n", "\n", "**Типы ответов**\n", " - вещественные ответы (регрессия)\n", " - конечное число ответов (классификация: бинарная/многоклассовая/пересекающиеся классы)\n", " - временной ряд\n", " - ранжирование\n", " - отсутствие ответа (кластеризация)\n", " \n", "**Типы признаков**\n", " - бинарный ({0, 1})\n", " - вещественный\n", " - категориальный (неупорядоченное конечное множество)\n", " - порядковый (упорядоченное конечное множество)\n", " - множественные признаки\n", " \n", "**Обобщающая способность**\n", " - недообучение\n", " - переобучение\n", " \n", "**Задачи анализа данных**\n", " - медицинская диагностика: объект — пациент, ответ — диагноз, классификация с пересекающимися классами; признаки: бинарные (пол), порядковые (тяжесть состояния), вещественные (вес)\n", " - кредитный скоринг\n", " - предсказание оттока клиентов\n", " - стоимость недвижимости\n", " - прогнозированние продаж (временные ряды)\n", " - рекомендательные системы" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Numpy \n", "### Полезная информация\n", " - [Все numpy функции](http://docs.scipy.org/doc/numpy-1.10.0/reference/)\n", " - [100 заданий по Numpy](http://www.labri.fr/perso/nrougier/teaching/numpy.100/)\n", " - Попробуйте в ipython notebook:\n", " - **'np.ar' + [Tab]** (autocompletion)\n", " - **'np.arange()' + [Shift+Tab]** (docstring)\n", " - **?np.arange** (object description)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Создание веторов:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(4,)\n", "(1, 4)\n", "(3, 2, 2)\n" ] } ], "source": [ "X = np.array([1, 2, 3, 4])\n", "print X.shape\n", "\n", "X = np.array([[1, 2, 3, 4]])\n", "print X.shape\n", "\n", "X = np.array([ [[1, 2],\n", " [3, 4]], \n", " [[1, 2],\n", " [3, 4]],\n", " [[1, 2],\n", " [3, 4]] ])\n", "print X.shape" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A = np.arange(15)\n", "A" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 1, 2, 3, 4],\n", " [ 5, 6, 7, 8, 9],\n", " [10, 11, 12, 13, 14]])" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A = np.arange(15).reshape(3, 5)\n", "A" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Пустой массив, массив без инициализации" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(0,)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = np.array([])\n", "X.shape" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "ename": "ValueError", "evalue": "total size of new array must be unchanged", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mX\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m5\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m5\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mValueError\u001b[0m: total size of new array must be unchanged" ] } ], "source": [ "X = np.array([]).reshape(2, 2)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 4.94065646e-324, 9.88131292e-324],\n", " [ 1.48219694e-323, 1.97626258e-323]])" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = np.empty((2, 2))\n", "X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Что еще: \n", "- *np.ndarray.dtype*, *np.astype()* [data type conversion]\n", "- *np.asarray()*, *np.ndarray.tolist()* [object type conversion]\n", "- **Syntax{np.arange()}**: *np.linspace()*, *np.logspace()*; **Syntax{np.empty()}**: *np.ones()*, *np.eye()*; *np.diag()* [creation]\n", "- *np.empty_like()*, *np.zeros_like()*, *np.ones_like()*, ***np.copy()*** [create_like, **deep copy** *vs* **view**]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Изменение размерностей" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 5, 10],\n", " [ 1, 6, 11],\n", " [ 2, 7, 12],\n", " [ 3, 8, 13],\n", " [ 4, 9, 14]])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A.T" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[0, 1, 2],\n", " [3, 4, 5]])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "B = np.arange(6).reshape(2, 3)\n", "B" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Конкатенация:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 5, 10],\n", " [ 1, 6, 11],\n", " [ 2, 7, 12],\n", " [ 3, 8, 13],\n", " [ 4, 9, 14],\n", " [ 0, 1, 2],\n", " [ 3, 4, 5]])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.concatenate((A.T, B), axis=0)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 1, 2, 3, 4, 0, 3],\n", " [ 5, 6, 7, 8, 9, 1, 4],\n", " [10, 11, 12, 13, 14, 2, 5]])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.concatenate((A, B.T), axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В чем разница между *concatenate* и *vstack*/*hstack*? [Stackoverflow!](http://stackoverflow.com/questions/33356442/when-should-i-use-hstack-vstack-vs-append-vs-concatenate-vs-column-stack)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Изменение размера матрицы (количество элементов должно оставаться тем же)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[0, 1, 2],\n", " [3, 4, 5]])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.arange(6).reshape(2, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Если задать один один из параметров равным -1, то он будет вычислен автоматически" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[0, 1, 2],\n", " [3, 4, 5]])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.arange(6).reshape(2, -1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Вытягивание в вектор" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.ravel(A)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0, 5, 10, 1, 6, 11, 2, 7, 12, 3, 8, 13, 4, 9, 14])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.ravel(A, order='F')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "?np.ravel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[flatten vs ravel](http://stackoverflow.com/questions/28930465/what-is-the-difference-between-flatten-and-ravel-functions-in-numpy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Что еще: \n", "- *np.swapaxes()*, *np.transpose()* [reshape]\n", "- *np.newaxis* [broadcasting], *np.split()*, *np.tile()*, *np.repeat()* [broadcast]\n", "- **Syntax{np.concatenate()}**: *np.vstack()*, *np.hstack()*; *np.append()* [concatenate]\n", "- *np.insert()*, *np.delete()*, *np.resize()*\n", "- *np.unique()*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Выборки элементов" ] }, { "cell_type": "code", "execution_count": 138, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[0, 1, 2, 3, 4],\n", " [5, 6, 7, 8, 9]])" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A = np.arange(10).reshape(2, 5)\n", "A" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Применить одно условие достаточно просто" ] }, { "cell_type": "code", "execution_count": 135, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([5, 6, 7, 8, 9])" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A[A > 4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Если нужно сделать несколько условий?" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "ename": "ValueError", "evalue": "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mA\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mA\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m4\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mA\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0;36m6\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mValueError\u001b[0m: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()" ] } ], "source": [ "A[A > 4 and A < 6]" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([5])" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A[np.logical_and(A > 4, A < 8)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Чтобы найти все ненулевые элементы матрицы:" ] }, { "cell_type": "code", "execution_count": 136, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(array([0, 0, 0, 0, 1, 1, 1, 1, 1]), array([1, 2, 3, 4, 0, 1, 2, 3, 4]))" ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A.nonzero()" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 3, 4, 5])" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A[A.nonzero()]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 10, 10, 10, 10],\n", " [10, 10, 10, 10, 10]])" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A[A.nonzero()] = 10\n", "A" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true }, "outputs": [], "source": [ "A = np.arange(10).reshape(2, 5)\n", "B = np.arange(9, -1, -1).reshape(2, 5) # на заметку slicing через аргументы: start = 9, stop=-1, step=-1" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[9, 8, 7, 6, 5],\n", " [4, 3, 2, 1, 0]])" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "B" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[9, 8, 7, 6, 5],\n", " [5, 6, 7, 8, 9]])" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.where(A>B, A, B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Использование для индексации" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ind = np.where(A>5)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([6, 7, 8, 9])" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A[ind[0], ind[1]]" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 1, 2, 3, 4],\n", " [ 5, -3, -3, -3, -3]])" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A[ind[0], ind[1]] = -3\n", "A" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Что еще: \n", "- **Syntax{np.logical_and()}**: *np.logical_and()*, *np.logical_not()* [&&, ||, ~]\n", "- *np.all*, *np.any()* [bool array check]\n", "- *np.isnan()*, *np.isinf()* [NaNs, Infs check]\n", "- *np.allclose()*, *np.isclose()*, *np.equal()* **[float comparison]**" ] }, { "cell_type": "raw", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Matrix\n", "Ещё один тип данных в NumPy — matrix. Является производным классом от ndarray, в связи с чем можно использовать все методы и функции, применимые к array. Однако:\n", " - matrix — строго 2мерные;\n", " - матричное умножение осуществляется через * (в отличие от dot для array)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [], "source": [ "A = np.arange(0, 4).reshape(2, 2)\n", "B = np.arange(3, 7).reshape(2, 2)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 4],\n", " [10, 18]])" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A * B" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "a = np.matrix(A)\n", "b = np.matrix(B)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "matrix([[ 5, 6],\n", " [21, 26]])" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a * b" ] }, { "cell_type": "raw", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Цель\n", "\n", "**Зачем** нам нужен NumPy, если есть вложенные списки/кортежи и циклы?\n", "\n", "Причина заключается в скорости работы:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Примечание: нет необходимости пользоваться модулем **random**, все есть в **numpy.random**" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": true }, "outputs": [], "source": [ "A_quick_arr = np.random.normal(size = (100000,))\n", "B_quick_arr = np.random.normal(size = (100000,))\n", "\n", "A_slow_list, B_slow_list = list(A_quick_arr), B_quick_arr.tolist()" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": true }, "outputs": [], "source": [ "N_repeat = 100" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0429953\n" ] } ], "source": [ "start = time.clock()\n", "for i in range(N_repeat):\n", " ans = 0\n", " for i in range(100000):\n", " ans += A_slow_list[i] * B_slow_list[i]\n", "print (time.clock() - start) / N_repeat # время выполнения в секундах" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.03926768\n" ] } ], "source": [ "start = time.clock()\n", "for i in range(N_repeat):\n", " ans = sum([A_slow_list[i] * B_slow_list[i] for i in range(100000)])\n", "print (time.clock() - start)/ N_repeat" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0005213\n" ] } ], "source": [ "start = time.clock()\n", "for i in range(N_repeat):\n", " ans = np.sum(A_quick_arr * B_quick_arr)\n", "print (time.clock() - start)/ N_repeat" ] }, { "cell_type": "raw", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Дома посмотреть:\n", "- Все \"Что еще\"\n", "- Модули **np.random**, **time** (Также полезно знать **%timeit**)\n", "- **Indexing** ([basic](http://docs.scipy.org/doc/numpy-1.10.1/user/basics.indexing.html), [advanced](http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html))\n", "- Statistic funcs: *amin()*, *amax()*, *mean()*, *std()*\n", "- Math funcs: *sum()*, *prod()*, *cumsum()*, *cumprod()*, *diff()*\n", "- Можно короче? Да: *a.min()* и *np.min(a)*\n", "- *nansum()*, *nanmin()*, *nanmax()*, *nanmean()*\n", "- **Читайте Docstrings**: See Also раздел, Notes раздел\n", "- **[Numpy Performance tricks](http://nbviewer.jupyter.org/gist/rossant/4645217)** **(!)**" ] }, { "cell_type": "raw", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }