Principal Component Analysis : A Tool for Identifying Web Document Characteristics Affecting Quality of Drug Information Websites

Article history: Received on: 29/08/2017 Accepted on: 14/10/2017 Available online: 30/11/2017 The objective of our study was to identify Web document characteristics affecting quality of drug information websites using principal component analysis (PCA) technique in order to assist consumers, patients, and Web developers for observing good designing aspects to achieve information quality. Internet websites were collected by using the 8 search terms and 3 mostly utilized search engines in Thailand. Thirty five drug information websites were assessed with two independent raters to find out the quality in drug information providing using DISCERN criteria. Sixteen characteristics of Web document were investigated. PCA was applied to the data and the principal components were plotted and visualized structurally for detection the most important characteristics which related to the quality of drug information websites. The six PCs accounted for 73.39% of the total variance. The overall mean of DISCERN score for quality was “fair” at 52.5 (range, 21–72; SD = 11.1). Four attributes then were chosen to be the factors which mostly influence quality status of drug information websites. These findings provide consumers and patients to observe the quality of sites that provide drug information as well as to support Web authors for improving the quality of drug information websites.


INTRODUCTION
Nowadays, looking through the Internet has turned into a typical device for people who wish to find out about their good beings and medical problems (Jansen and Spink, 2006).The amount of Internet websites offering health-related information increases quickly every day (NCCAM, 2006) including drug information websites.In the past seventy-two percent of online users surfed for health and medical information which described of one kind or another such as serious conditions of drug or diseases, general information, or minor health problems (Fox and Duggan, 2013).Moreover, seventy-seven percent of online health users started with a search engine such as Google, Yahoo, .
or Bing (Fox and Duggan, 2013).The patients used the World Wide Web for various health-related reasons.Fifty-eight percent mentioned employing the World Wide Web to review side effects of drug or complications of medical therapy (Joseph et al., 2002).Good drug information provided can be used to prevent medication errors and lead to enhanced quality of patient care.In the real world, websites can be developed by anyone.Thus, many websites provide significant information but others may provide information questioningly or deludingly.The consumers ought to consider online drug information since nobody has altered the tremendous amount of drug information on the web in order to guarantee its quality and accuracy.That is the motivation to have instruments for assessing the quality of the websites.There are several instruments for quality assessment of websites.Four instruments are broadly utilized today; the DISCERN tool, the HON Code, CyberGuide and the JAMA benchmarks.In this study, we were keen on inspecting the quality of accessible drug information to patients.
Hence, we selected a simple and common instruments -DISCERN.The DISCERN is a set of questionnaires which is divided into 3 parts: 1) reliability of the publication, 2) quality of information about treatment choices, and 3) the overall quality rating (Charnock D, 1998).DISCERN is easy to use and can be utilized not only patients but also pharmacists or authors of health information as a standard guide which users are qualified for anticipation.A variety of reports has characterized the manner in which consumers search for health-related information (Zhang and Dimitroff, 2005;Zhang and Dimitroff, 2005) and has been assessing the quality of health-related information on the websites (Woodruff, 1996;Jadad, 2006;McCool, et al., 2015;Memon, et al., 2016).There is no research study on the Web document characteristics of drug information websites and the quality of providing drug information.The purpose of this study is to apply principal component analysis (PCA) technique for identifying Web document characteristics that affect the designing drug information websites in order to achieve information quality.PCA is one of the statistical multivariate methods based on eigenvector decomposition.It was first presented by Pearson (1901) and developed separately by Hotelling (1933) (Jolliffe, 2002).PCA consolidates the majority of the variables in which there are interrelated into a smaller number of principal components (PCs) (Sratthaphut, et al., 2013).Those PCs then, are visualized structurally, while holding as much as possible of the variation exhibits in the data set.

Sites identification and evaluation
Internet sites were identified using two general search terms ('drug information' and 'medical information') and six specific search terms ('amoxicillin', 'celecoxib', 'hydrochlorothaizide', lipitor', 'prozac' and 'spironolactone') and three mostly utilized search engines in Thailand (Google, Yahoo!, and Bing, accessed on January, 2016).When the keywords were entered into these Web crawlers, the World Wide Web was filtered to discover websites related to these search keywords.A list of universal resource locators (URLs) was shown up in the search engine result pages (SERP) and arranged in decreasing order of relevance to the search keywords.Because drug information websites achieved through good quality information may not be good rank on SERPs.Thus, the top 30 consecutive URLs and the URLs in the range between 301 th and 330 th listed in SERP were collected in each search result.Once, one thousand four hundred and forty URLs (60x3x8) were returned by eight keyword searches conducted in each of the three search engines, the sites that met to the following criteria have been discarded.
-Non-English language sites -Non-Drug information providing sites -Illegal content or design sites -Advertising purpose sites -Duplicate sites -Book review sites or journal abstract offering sites -Non-operative sites or sites with required to apply for registration The rest thirty five drug information websites were included.Then, each site was independently evaluated by two raters (registered pharmacists) according to DISCERN criteria.The DISCERN instrument comprises of 16 key inquiries with fivepoint Likert scale which provides users with a reliable framework for assessing the quality of health information.The raw scores are included toward the finish of the assessment.Along these lines, the greatest number of possible points is 80.The raw scores obtained were changed over the percentage scores.Class interval was calculated by the formula: (Smax-Smin)/3, where Smax is the overall maximum score (72), and Smin is the overall minimum score (21).
Class interval was employed to compute three quality levels (groups).All sites were classified into three groups according to their percentage scores and were interpreted as follows: >56% "good", 39-56% "fair" and <39% "poor".

Data collection
The twenty Web documents (also referred to as Web pages) of each evaluated drug information websites (35 sites) were randomly downloaded on March 2016 and stored for analysis.Hence, the data set used in this study composed of 700 Web documents (35x20).The observed Web document characteristics were document size, time used to download the document, image size, CSS size, the number of the broken links, the number of errors in HTML tags, and the number of HTML tags including <h1>, <h2>, <h3>, <h4>, <h5>, <h6>, <link>, <meta>, <p> and <img>.The values of these variables were obtained by using free tools.The document size, time used to download, image size, and CSS size were collected by Web Page Analyzer -0.98 (http://www.websiteoptimization.com/services/analyze/).The number of the broken links was calculated by the W3C Link checker (http://validator.w3.org/checklink).
The errors in HTML tags related to the pages were found by the W3C Markup Validation Service (http://validator.w3.org/).And the last, the extraction of tagged text was done by using HTML Tag Count (http://redwriteblue.com/tags/htmlcount. html).

Data analysis
Cohen's kappa statistic were applied to test inter-rater agreement and the kappa values were interpreted in agreement with Fleiss: 0.0 to 0.40 poor, 0.41 to 0.75 fair to good, and >0.76 excellent (Fleiss et al., 2003;McCool et al., 2015).All data were processed by a Notebook equipped with Intel® Core TM i3 processor, 2GB for RAM, and Windows 7. The principal component analysis (PCA) and all statistical parameters were performed by SciCraft open source data analysis software 1.0.2(Alsberg et al., 2004) and PSPP 0.9.0 open source statistical software (Plaff and Darrington, 2015), respectively.
In PCA study, PCA was applied to the data in order to detect the most important factors to describe the quality of drug information websites.The six PCs with eigen values greater than one, a general statistical cut-off level (Hamilton, 2010), were selected.The six PCs accounted for 73.39% of the total variance.Therefore, the number of variables were compressed from 16 variables to 6 uncorrelated PCs with 26.61% loss of variation.The first principal component (PC1) was the linear combination that best condenses varieties in the original data matrix (22.53% of the cumulative proportion of variation explained), while the others (PC2-PC6) outlined the rest of the variance (50.86%).In score plot Figure 2, websites were grouped according to their quality status.Most of the good quality sites were clustered on the topright of the Figure 2.This indicates that PC1 and PC2 could be considered as a representative of quality status.
In loading plot Figure 3, the Web document variables are represented as a function of both PC1 and PC2.The PC2 is positive values on variable 3 (the average number of tag <h3>), 4 (the average number of tag <h4>), 5 (the average number of tag <h5>), 6 (the average the average number of tag <h6>), 7 (the average number of tag <link>), 8 (the average number of tag <meta>), 9 (the average number of tag <p>), 12 (the average percentage of image size per page size), 14 (the average number of <img>), and 16 (the average number of HTML error per page) and shows negative values on variable 1 (the average number of tag <h1>), 2 (the average number of tag <h2>), 10 (the average of download times), 11 (the average of page size), 13 (the average percentage of CSS size per page size) and 15 (the average percentage of dead link per total link).Moreover, the variable 3, 4, 5 and 9 have high positive values of PC1 and PC2.As seen from the Figure 2 and Figure 3, variables, appearing on top-right side of Figure 3, are about the same top-right quadrant of Figure 2. It illustrated that the tag <h3>, <h4>, <h5> and <p> were correlated with an increase in the quality of drug information websites.
Thus, those variables should be the potential factors responsible for quality of drug information websites.In general, tag <h1> is used for the most significant headline on the page then tag <h2>, <h3> and so on.
Web authors use the headings tag to isolate subjects related to the importance of the information.Additionally, heading tags are often followed by a short paragraph which represents in tag <p>.Headings and paragraph structure make content more understandable.For this reason, the good quality drug information websites have high average number of tag <h3>, <h4>, <h5> and <p>.The previous mention that the good quality of drug information sites enriched pages with information to make them more understanding was confirmed by these PCA results.

CONCLUSIONS
This paper aims to investigate into various Web document characteristics related to quality drug information providing of websites.PCA technique is the powerful choice for the determination of most significant factors on the quality.It was found that websites parameters mostly influence quality status were tag <h3>, <h4>, <h5>, and <p>.The explanation of these results is the more number of paragraphs are added, the more number of qualities of sites are increased.This conclusion allows health information consumers and patients to make primary decisions about the reliance on drug information websites.In addition, this conclusion is also useful for Web developers in highlighting these attributes to refine good designing aspects in order to achieve information quality.We can extend this work to investigate whether these characteristics are correlated with quality assessment by using HON Code, CyberGuide and JAMA benchmarks.

Fig. 1 :
Fig. 1: Summary of the average number of unique tags.