Data from the GIPEyOP online election poll for the 2015 Spanish General election.

The general elections of 2015 in Spain took place in the middle of the Great Recession after several years of austerity economic policies. This election caused a political earthquake that shook the Spanish party system. During the campaign of that election, GIPEyOP (Elections and Public Opinion Research Group from University of Valencia) conducted a survey to collect relevant data about the electorate beliefs, intentions and motivations. This article describes the data set attained, which comprises 71 variables after removing, to ensure full anonymity, those variables that would potentially allow respondents to be identified. Respondents answered a self-administered online questionnaire and were recruited using chain sampling. A total of 14,261 valid observations were collected between 27th November and 18th December 2015. GIPEyOP employed the data collected up to 14th December to deliver a prediction of the election outcomes during that election campaign. Among other issues, this data set may be reused to assess theories of expectations’ formation, to spot how social networks spread geographically and to measure gender, age and education technological gap of the Spanish population.

Social Sciences, Sociology, Political Science Specific subject area Social Sciences (general), Public opinion, Political Science Type of data csv file How data were acquired Data was obtained through a self-administered online questionnaire. LimeSurvey was used to conduct the survey. The questionnaire used to be implemented in the online version is provided as supplementary material with the article (in word format). Data format Raw Parameters for data collection A snowball or chain sampling method was used to recruit respondents. Description of data collection The survey was carried out on occasion of the 2015 Spanish General Election.
The survey data were collected over twenty days (between 27 th November to 18 th December 2015). Data source location Country: Spain Data accessibility Data file (comma-separated values format, csv file) is supplied as supplementary material with this article.

Value of the data
• This dataset comprises the second public available largest sample of the 2015 Spanish General Election. • Social scientists, including sociologists, political scientists and public opinion researchers, may benefit from these data. • Theories of expectations' formation and of diffusion of social events can be tested using this dataset. • Although the dataset contains many standard public opinion variables, this dataset with 71 variables is unique providing non-standard variables; among them, respondents' beliefs and preferences and dates and times of responses. • This dataset is an example that valuable information can be extracted from non-random samples. • Gender, age and education technological gap of the Spanish population may be also studied using these data.

Data Description
Data was obtained through a self-administered online questionnaire, which was implemented by using LimeSurvey (an open source survey tool). The questionnaire is provided with the article as a supplementary material. Table 1 shows a description of the variables available in the dataset.
As we can see in Table 1 , the values of the variable PROV ( section 1 ) correspond to the Spanish provinces (see Table 2 ). In the questionnaire, the respondent had to select the province in where she/he had the right to vote, not her/his province of residence.
Section III of the questionnaire asked two questions: (i) If the General Elections were held tomorrow, which political party or electoral alliance would your vote for? (variable VOTE.GEN), and (ii) When in doubt, what would be your second choice? (variable VOTE.GEN.2). These questions were conditional questions since not all political parties were running in all provinces. Depending on the province in where the respondent had the right to vote, different political parties were shown as an answer option to the respondent. Table 3 shows the main political parties running in the 2015 Spanish General election with the identification code included in the dataset.
Similarly, section VI asked three questions (see the questionnaire) about the political party that the respondent voted for in the 2014 European elections (variable EUR2014), in the 2011  Table 3 IV from PORC.J1 to PORC.J15, and PORC.J99 In your opinion, what will be the most likely distribution of votes (as a percentage) in your province in the next general election?
Values between 0 and 100. The sum of the percentages of votes for all political parties (see Table 3  ( continued on next page ) ( continued on next page ) General election (variable GEN2011), and in the last Regional elections (variable AUT). Table 4 shows the main political parties that were running in these elections with their corresponding identification code in the dataset. Data was collected between 27 th November and 18 th December 2015. The dataset, which is provided with the article, contains a total of 14,261 valid observations of 71 variables (see Table  1 ). Table 5 shows the distribution of the sample sizes by province and Table 6 the distribution by Autonomous Community.

Experimental Design, Materials, and Methods
The Internet has been a real revolution that is opening up very interesting research possibilities for social scientists. Thus, it is not surprising that we are witnessing the emergence of new experiences, mainly from the academic world, which, exploiting the possibilities of the Internet, seek to demonstrate that it is also possible to generate quality predictions with biased samples. From the use of responses collected from Xbox users [1] to employing mechanisms where the potential respondent population is not selected by the pollster, but rather the respondents selfselect. Thus, during the campaign for the 2015 General Election in Spain on 20 th December, the research group GIPEyOP (http://gipeyop.uv.es/) carried out an experience of this nature: a selfadministered online questionnaire was released and a snowball (or chain-referral) sampling was used [2] .  We launched the questionnaire from Valencia via email and social networks such as What-sApp, Facebook, Twitter, etc. In our message we asked for the collaboration of the respondents so that they could distribute, at the same time, the questionnaire among their acquaintances, friends and family. Each of the questionnaires received was subjected to an intense filtering process to select only those questionnaires with a minimum quality (internal consistency) and quantity requirements in the available information. Among other issues, (i) we controlled that the responses were made from a Spanish IP address, and (ii) we compared the responses collected with two electronic versions of the questionnaire where we set different specifications Other options about the number of attempts available and we assessed the consistency of respondents considering variables like leaders' assessment, ideology or vote intention. These actions lead us to discard 4,544 responses. The validated dataset contains a total of 14,261 observations of 71 variables (see Table 1 ).

Data Quality
The data available cannot be considered as a simple random sample and it is difficult to consider it as a representative sample. The collection method means that the selection procedure necessarily introduces coverage and self-selection bias into the sample. The question of the theoretical non-representativeness of the sample does not constitute a differential fact of our data. All electoral opinion samples suffer to a greater or lesser extent from the problem of representativeness, mainly due to the differential non-response rates that pollsters encounter during  fieldwork [3] . This problem even happens to the more respected pollsters, such as the Centro de Investigaciones Sociológicas (CIS), the most prestigious Spanish survey organization [4] . As a random selected example, we can consider the barometer conducted by CIS in October 2014, when comparing collected raw answers and related actual data, we observe that just 28% of the respondents claimed to have voted for Popular Party (PP) in the 2011 Spanish General Election [5] , when actually 45% of voters supported PP in that election. Similarly, the raw data available in our dataset has different sources of bias, as it can be observed in Table 7 . In Table 7 we compare, for some variables, sample data aggregations with actual register data and, as it is obvious, different subgroups of population were overrepresented (like the people living in the Valencian region), whereas other groups were underrepresented (such as the PP voters). This does not mean that not valuable information can be derived from the data available. As an example, during the election campaign, on 14 th December 2015, the last day to release polls to the public according to the Spanish electoral law, GIPEyOP delivered a prediction for the election outcomes and the estimates made by GIPEyOP were among the top-ten most accurate Table 7 Actual and Dataset distributions for some regional and national level available registers. predictions published during that electoral campaign. In particular, it was the sixth out of 28 poll-based published vote estimates of the 2015 General Election. GIPEyOP estimates were built after amending the major deviations presented in the collected data by constructing vote propensities using socio-demographic variables and reported recall votes. Particularly, the prediction methodology of the GIPEyOP survey was based on the estimation (through the use of multilevel models) of the probabilities that each person has of voting for each party based on her/his individual variables and the characteristics of the environment where she/he lived. As individual characteristics, the following variables (see Table 1 ) available from the questionnaire were considered: age, sex, level of studies and voting history of the surveyed person; while, as regards contextual characteristics, the model included the province of residence, the demographic structure of the province (as regards the distribution of the population by municipality size and by age groups) and the Autonomous Community.
The example above shows that, by properly weighting the responses, the dataset described in this paper can be used to make accurate population inferences. For example, the interested reader may use the marginal distributions in Table 7 not only to assess the level of bias in our dataset, but also to calibrate the sample and, what's more, she/he may employ the accompanied Appendix file (Excel file supplied as supplementary material) to construct weights from the joint distributions. Likewise, in our view, when constructing individual level models, the biases presented in the dataset could be overcame just by working conditionally, i.e., by including the biased features as explanatory variables in the model. This dataset therefore could be reused to assess theories of expectations' formation [6] , to spot how social networks spread geographically or to measure gender, age and education technological gaps of the Spanish population.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.