{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Подключаем библиотеки"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import scipy as sp\n",
    "import pandas as pd\n",
    "pd.set_option(\"display.max_columns\", 1000)\n",
    "pd.set_option('display.max_colwidth', -1)\n",
    "\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Конкурс доступен по [ссылке](https://www.kaggle.com/c/quora-question-pairs). Наша задача - по паре вопросов предсказать вероятность того, что они об одном и том же (дубликаты)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Считываем [данные](https://www.kaggle.com/c/quora-question-pairs/data):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train = pd.read_csv('./input/train.csv')\n",
    "test = pd.read_csv('./input/test.csv')\n",
    "\n",
    "train.fillna('', inplace=True)\n",
    "test.fillna('', inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "В данных нас интересуют колонки \"question1\" и \"question2\" - сами вопросы. В train выборке есть целевая колонка \"is_duplicate\", кроме того, у каждого вопроса есть свой номер - \"qid1\" и \"qid2\" соответственно. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>qid1</th>\n",
       "      <th>qid2</th>\n",
       "      <th>question1</th>\n",
       "      <th>question2</th>\n",
       "      <th>is_duplicate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>What is the step by step guide to invest in share market in india?</td>\n",
       "      <td>What is the step by step guide to invest in share market?</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>What is the story of Kohinoor (Koh-i-Noor) Diamond?</td>\n",
       "      <td>What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>How can I increase the speed of my internet connection while using a VPN?</td>\n",
       "      <td>How can Internet speed be increased by hacking through DNS?</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>Why am I mentally very lonely? How can I solve it?</td>\n",
       "      <td>Find the remainder when [math]23^{24}[/math] is divided by 24,23?</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?</td>\n",
       "      <td>Which fish would survive in salt water?</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id  qid1  qid2  \\\n",
       "0  0   1     2      \n",
       "1  1   3     4      \n",
       "2  2   5     6      \n",
       "3  3   7     8      \n",
       "4  4   9     10     \n",
       "\n",
       "                                                                      question1  \\\n",
       "0  What is the step by step guide to invest in share market in india?             \n",
       "1  What is the story of Kohinoor (Koh-i-Noor) Diamond?                            \n",
       "2  How can I increase the speed of my internet connection while using a VPN?      \n",
       "3  Why am I mentally very lonely? How can I solve it?                             \n",
       "4  Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?   \n",
       "\n",
       "                                                                                  question2  \\\n",
       "0  What is the step by step guide to invest in share market?                                  \n",
       "1  What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?   \n",
       "2  How can Internet speed be increased by hacking through DNS?                                \n",
       "3  Find the remainder when [math]23^{24}[/math] is divided by 24,23?                          \n",
       "4  Which fish would survive in salt water?                                                    \n",
       "\n",
       "   is_duplicate  \n",
       "0  0             \n",
       "1  0             \n",
       "2  0             \n",
       "3  0             \n",
       "4  0             "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>test_id</th>\n",
       "      <th>question1</th>\n",
       "      <th>question2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>How does the Surface Pro himself 4 compare with iPad Pro?</td>\n",
       "      <td>Why did Microsoft choose core m3 and not core i3 home Surface Pro 4?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Should I have a hair transplant at age 24? How much would it cost?</td>\n",
       "      <td>How much cost does hair transplant require?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>What but is the best way to send money from China to the US?</td>\n",
       "      <td>What you send money to China?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>Which food not emulsifiers?</td>\n",
       "      <td>What foods fibre?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>How \"aberystwyth\" start reading?</td>\n",
       "      <td>How their can I start reading?</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   test_id  \\\n",
       "0  0         \n",
       "1  1         \n",
       "2  2         \n",
       "3  3         \n",
       "4  4         \n",
       "\n",
       "                                                            question1  \\\n",
       "0  How does the Surface Pro himself 4 compare with iPad Pro?            \n",
       "1  Should I have a hair transplant at age 24? How much would it cost?   \n",
       "2  What but is the best way to send money from China to the US?         \n",
       "3  Which food not emulsifiers?                                          \n",
       "4  How \"aberystwyth\" start reading?                                     \n",
       "\n",
       "                                                              question2  \n",
       "0  Why did Microsoft choose core m3 and not core i3 home Surface Pro 4?  \n",
       "1  How much cost does hair transplant require?                           \n",
       "2  What you send money to China?                                         \n",
       "3  What foods fibre?                                                     \n",
       "4  How their can I start reading?                                        "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "В качестве признаков возьмем частоты слов в вопросах"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "prep = CountVectorizer().fit(list(train['question1']) + list(train['question2']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_train = sp.sparse.hstack([prep.transform(train['question1']),\n",
    "                            prep.transform(train['question2'])])\n",
    "\n",
    "Y_train = train['is_duplicate'].as_matrix()\n",
    "\n",
    "X_test = sp.sparse.hstack([prep.transform(test['question1']),\n",
    "                           prep.transform(test['question2'])])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Обучаем [логистическую регрессию](http://bit.ly/2pTNQOj) с L1-[регуляризацией](http://bit.ly/1FeZ3d9)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
       "          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
       "          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,\n",
       "          verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf = LogisticRegression(penalty='l1')\n",
    "clf.fit(X_train, Y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Предсказываем нужные вероятности"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "preds = clf.predict_proba(X_test)[:, 1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Записываем в файл"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "submission = pd.read_csv('./input/sample_submission.csv')\n",
    "submission['is_duplicate'] = preds\n",
    "submission.to_csv('submission.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Файл с предсказанием нужно загрузить по [ссылке](https://www.kaggle.com/c/quora-question-pairs/submit). Ваше решение будет оцениваться по метрике [Logloss](https://www.kaggle.com/c/quora-question-pairs#evaluation)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}