Cleaning uncertain graphs via noisy crowdsourcing

Wu, Yongcheng; Lin, Xin; Yang, Yan; He, Liang

doi:10.1007/s11280-018-0624-8

Cleaning uncertain graphs via noisy crowdsourcing

Published: 31 July 2018

Volume 22, pages 1523–1553, (2019)
Cite this article

World Wide Web Aims and scope Submit manuscript

Yongcheng Wu¹,
Xin Lin¹,
Yan Yang ORCID: orcid.org/0000-0001-9922-2508¹ &
…
Liang He¹

303 Accesses
2 Citations
Explore all metrics

Abstract

Uncertain graph is an important data model for many real-world applications. To answer the query on the uncertain graphs, the edges in these graphs are associated with existential probabilities that represent the likelihood of the existence of the edge. Almost all works on this area focus on how to promote the efficiency of the query processing. However, another issue should be notable, that is, the query results from the uncertain graphs are sometimes uninformative due to the edge uncertainty. We adopt a crowdsourcing-based approach to make the query results more informative. To save the monetary and time cost of crowdsourcing, we should select the optimal edges to clean to maximize the quality improvement. However, the noise of the crowdsourcing results will make the problem more complex. We prove that the problem is #P-hard and propose an efficient algorithm to derive the optimal edge. Our experimental results show that our proposed algorithm outperforms random-selection up to 22 times in quality improvement and each-edge-comparison way up to 5 times fast in elapsed time, which proves this algorithm is both effective and efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 7

A hybrid information-based two-phase expansion algorithm for community detection with imbalanced scales

Article 06 April 2024

Shiliang Liu, Xinyao Zhang & Yinglong Ma

Graph based anomaly detection and description: a survey

Article 05 July 2014

Leman Akoglu, Hanghang Tong & Danai Koutra

Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data

Article 30 March 2024

Payel Sadhukhan & Sarbani Palit

Notes

References

Aggarwal, C.C.: Managing and mining uncertain data. Springer, US (2009)
Book MATH Google Scholar
Ball, M.O.: Computational complexity of network reliability analysis: an overview. IEEE Trans. Reliab. 35(3), 230–239 (1986)
Article MATH Google Scholar
Brabham, D.C.: Crowdsourcing as a model for problem solving: an introduction andcases. Convergence the International Journal of Research Into New Media Technologies 14(1), 75–90 (2008)
Article Google Scholar
Chen, M., Gu, Y., Bao, Y., Yu, G.: Label and distance-constraint reachability queries in uncertain graphs. In: Database Systems for Advanced Applications, pp 188–202. Springer International Publishing, Cham (2014)
Cheng, J., Huang, S., Wu, H., Fu, W.C.: Tf-label:a topological-folding labeling scheme for reachability querying in a large graph. In: ACM SIGMOD International Conference on Management of Data, pp. 193–204 (2013)
Cheng, R.: Querying and cleaning uncertain data. Springer, Berlin (2009)
Book Google Scholar
Cheng, R., Chen, J., Xie, X.: Cleaning uncertain data with quality guarantees. Proceedings of the Vldb Endowment 1(1), 722–735 (2008)
Article Google Scholar
Doan, A.H., Ramakrishnan, R., Halevy, A.Y.: Crowdsourcing systems on the world-wide Web. Commun. ACM 54(4), 86–96 (2011)
Article Google Scholar
Fishman, G.S.: A comparison of four monte carlo methods for estimating the probability of s-t connectedness. IEEE Trans. Reliab. 35(2), 145–155 (1986)
Article MATH Google Scholar
Jin, R., Hong, H., Wang, H., Ning, R., Xiang, Y.: Computing label-constraint reachability in graph databases. In: ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, Usa, June, pp. 123?-134 (2010)
Jin, R., Liu, L., Ding, B., Wang, H.: Distance-constraint reachability computation in uncertain graphs. Very Large Data Bases 4(9), 551–562 (2011)
Google Scholar
Jin, R., Liu, L., Ding, B., Wang, H.: Distance-constraint reachability computation in uncertain graphs. Proceedings of the Vldb Endowment 4(9), 551–562 (2011)
Article Google Scholar
Karp, R.M., Luby, M.G.: A new monte-carlo method for estimating the failure probability of an (1983)
Khan, A., Chen, L.: On uncertain graphs modeling and queries. VLDB Endowment (2015)
Krogan, N.J., Cagney, G., Yu, H., Zhong, G., Guo, X., Ignatchenko, A., Li, J., Pu, S., Datta, N., Tikuisis, A.P.: Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature 440(7084), 637–43 (2006)
Article Google Scholar
Lin, X., Xu, J., Hu, H.: Range-based skyline queries in mobile environments. IEEE Trans. Knowl. Data Eng. 25(4), 835–849 (2013)
Article Google Scholar
Lin, X., Peng, Y., Choi, B., Xu, J.: Human-powered data cleaning for probabilistic reachability queries on uncertain graphs. IEEE Trans. Knowl. Data Eng. 29(7), 1452–1465 (2017)
Article Google Scholar
Marcus, A., Wu, E., Karger, D., Madden, S., Miller, R.: Human-powered sorts and joins. Proceedings of the Vldb Endowment 5(1), 13–24 (2011)
Article Google Scholar
Mo, L., Cheng, R., Li, X., Cheung, D.W.: Cleaning uncertain data for top-k queries. In: IEEE International Conference on Data Engineering, pp. 134–145 (2013)
Niedermayer, J., Emrich, T., Renz, M., Mamoulis, N., Chen, L., Kriegel, H.P.: Probabilistic nearest neighbor queries on uncertain moving object trajectories. Proceedings of the Vldb Endowment 7(3), 205–216 (2013)
Article Google Scholar
Papadias, D., Tao, Y., Fu, G., Seeger, B.: Progressive skyline computation in database systems. ACM Trans. Database Syst. 30(1), 41–82 (2005)
Article Google Scholar
Ruomingjin Linliu, B.H.: Distanceconstraintreachabilitycomputationin. Pvldb 4 (9), 2011 (2012)
Google Scholar
Solecki, B., Solecki, B., Solecki, B.: Kdd cup 2013 - author-paper identification challenge: second place team. In: Kdd Cup 2013 Workshop, pp. 3 (2013)
Soliman, M.A., Ilyas, I.F., Chang, C.C.: Top-k query processing in uncertain databases. In: IEEE International Conference on Data Engineering, pp. 896–905 (2007)
Tao, Y., Xiao, X., Pei, J.: Efficient skyline and top-k retrieval in subspaces. IEEE Trans. Knowl. Data Eng. 19(8), 1072–1088 (2007)
Article Google Scholar
Tong, Y., Chen, L., Cheng, Y., Yu, P.S.: Mining frequent itemsets over uncertain databases. Proceedings of the Vldb Endowment 5(11), 1650–1661 (2012)
Article Google Scholar
Tong, Y., Chen, L., Ding, B.: Discovering threshold-based frequent closed itemsets over probabilistic data. In: IEEE International Conference on Data Engineering, pp. 270–281 (2012)
Tong, Y., Cao, C.C., Zhang, C.J., Li, Y.: Crowdcleaner: Data cleaning for multi-version data on the Web via crowdsourcing. In: IEEE International Conference on Data Engineering, pp. 1182–1185 (2014)
Verroios, V., Garcia-Molina, H.: Entity resolution with crowd errors. In: IEEE International Conference on Data Engineering, pp. 219–230 (2015)
Wang, J., Li, G., Kraska, T., Franklin, M.J., Feng, J.: Leveraging transitive relations for crowdsourced joins. In: ACM SIGMOD International Conference on Management of Data, pp. 229–240 (2013)
Widom, J., Agrawal, A.P., Benjelloun, O., Ch, A., Chaumond, J., Murthy, R., Mutsuzaki, M., Sugihara, T., Theobald, M.: Chapter 5 trio: A system for data, uncertainty, and lineage (2013)
Xu, K., Zou, L., Yu, J.X., Chen, L., Xiao, Y., Zhao, D.: Answering Label-Constraint Reachability in Large Graphs. In: ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October, pp. 1595?-1600 (2011)
Zhang, C.J., Chen, L., Jagadish, H.V., Cao, C.C.: Reducing uncertainty of schema matching via crowdsourcing. Proceedings of the Vldb Endowment 6(9), 757–768 (2013)
Article Google Scholar
Zhang, C.J., Chen, L., Tong, Y., Liu, Z.: Cleaning uncertain data with a noisy crowd. In: IEEE International Conference on Data Engineering, pp 6–17 (2015)

Download references

Acknowledgments

This research is funded by NSFC (No. 61773167) and the Natural Science Foundation of Shanghai (No.17ZR1444900).

Author information

Authors and Affiliations

East China Normal University, Shanghai, China
Yongcheng Wu, Xin Lin, Yan Yang & Liang He

Authors

Yongcheng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Liang He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan Yang.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Web and Big Data

Guest Editors: Junjie Yao, Bin Cui, Christian S. Jensen, and Zhe Zhao

Appendix

1.1 New Reachability

The reachability when crowd returns ‘yes’ is

$$ R_{G,q}^{y}=R_{G,q}+P_{q,e}^{*}({p_{e}^{y}}-p_{e}) $$

(24)

And the reachability when crowd returns ‘no’ is

$$ R_{G,q}^{n}=R_{G,q}+P_{q,e}^{*}({p_{e}^{n}}-p_{e}) $$

(25)

where ${p_{e}^{y}}$ is new edge probability if the crowd’s answer is ‘yes’, similar form for ${p_{e}^{n}}$.

Proof

First, we divide the whole G into two parts: graphs containing e (G_e) and graphs without e ($G_{\overline {e}}$). Assume for every $pg \in PG_{G_{e}}$, there is a corresponding $pg^{\prime } \in PG_{G_{\overline {e}}}$ such that all edges $E_{pg} \in E$ and $E_{pg^{\prime }} \in E$ are the same except that $E_{pg^{\prime }}$ doesn’t have e. Then, we have

$$ Pr(pg^{\prime})=\frac{(1-p_{e})Pr(pg)}{p_{e}} $$

(26)

By (2), $R_{G,q}^{0}$ can also be represented as:

$$ R_{G,q}^{0}=\sum\limits_{pg \in PG_{G_{e}}} Pr(pg)r_{q}^{pg} + Pr(pg^{\prime})r_{q}^{pg^{\prime}} $$

(27)

Furthermore, $G_{e}$ can be divided into $G(P_{e},P_{\overline e})$ and $G_{e}-G(P_{e},P_{\overline e})$. Obviously, for $pg \in PG_{G(P_{e}, P_{\overline e})}$, $r_{q}^{pg}= 1$ and $r_{q}^{pg^{\prime }}= 0$. For $pg \in PG_{G_{e}-G(P_{e}, P_{\overline e})}$, since whether e exists or not does not influence $r_{q}$, we have $r_{q}^{pg}=r_{q}^{pg^{\prime }}$.

Then, for $pg \in PG_{G_{e}}$, the new graph probability $Pr_{e \rightarrow y}(pg) = \frac {{p_{e}^{y}} Pr(pg)}{p_{e}}$ and $Pr_{e \rightarrow n}(pg) = \frac {{p_{e}^{n}} Pr(pg)}{p_{e}}$, where $e \rightarrow y$ ($e \rightarrow n$) represents edge e is cleaned to existence (nonexistence). For $pg \in PG_{G_{\overline e}}$, $Pr_{e \rightarrow y}(pg^{\prime }) = \frac {(1-{p_{e}^{y}}) Pr(pg)}{p_{e}}$ and $Pr_{e \rightarrow n}(pg^{\prime }) = \frac {(1-{p_{e}^{n}}) Pr(pg)}{p_{e}}$. Hence, for $pg \in PG_{G_{e}}$,

$$\begin{array}{@{}rcl@{}} Pr(pg) + Pr(pg^{\prime}) &=& Pr_{e \rightarrow y}(pg) + Pr_{e \rightarrow y}(pg^{\prime})\\ &=& Pr_{e \rightarrow n}(pg) + Pr_{e \rightarrow n}(pg^{\prime}) \end{array} $$

From above analysis, for $pg \in PG_{G_{e}-G(P_{e}, P_{\overline e})}$ we have

$$\begin{array}{@{}rcl@{}} Pr(pg)r_{q}^{pg} + Pr(pg^{\prime})r_{q}{pg^{\prime}}\\ = Pr_{e \rightarrow y}(pg)r_{q}(pg) + Pr_{e \rightarrow y}(pg^{\prime})r_{q}(pg^{\prime})\\ = Pr_{e \rightarrow n}(pg)r_{q}(pg) + Pr_{e \rightarrow n}(pg^{\prime})r_{q}(pg^{\prime}) \end{array} $$

(28)

For $pg \in PG_{G(P_{e}, P_{\overline e})}$, we have

$$ Pr(pg)r_{q}^{pg} + Pr(pg^{\prime})r_{q}{pg^{\prime}}=Pr(pg) $$

(29)

$$ Pr_{e \rightarrow y}(pg)r_{q}^{pg} + Pr_{e \rightarrow y}(pg^{\prime})r_{q}^{pg^{\prime}} = \frac{{p_{e}^{y}} Pr(pg)}{p_{e}} $$

(30)

$$ Pr_{e \rightarrow n}(pg)r_{q}^{pg} + Pr_{e \rightarrow n}(pg^{\prime})r_{q}^{pg^{\prime}} = \frac{{p_{e}^{n}} Pr(pg)}{p_{e}} $$

(31)

Then, $R_{G,q}^{0}$ is equal to sum of each element: (29) + the first line of (28); ${R_{q}^{y}}$ is equal to sum of each element: (30) + the second line of (28) and ${R_{q}^{n}}$ is equal to sum of each element: (31) + the third line of (28). Therefore,

$$\begin{array}{@{}rcl@{}} R_{G,q}^{y}-R_{G,q}^{0} &= &\sum\limits_{pg \in PG_{G(P_{e},P_{\overline e})}} \frac{({p_{e}^{y}} -p_{e})Pr(pg)}{p_{e}}\\ &=& ({p_{e}^{y}}-p_{e})P_{q,e}^{*}\\ R_{G,q}^{n}-R_{G,q}^{0} &= &\sum\limits_{pg \in PG_{G(P_{e},P_{\overline e})}} \frac{({p_{e}^{n}} -p_{e})Pr(pg)}{p_{e}}\\ &= &({p_{e}^{n}}-p_{e})P_{q,e}^{*} \end{array} $$

□

1.2 NECP is # P-hard

Proof

To prove NECP is #P-hard, we reduce ECP in [17] to NECP as Xin Lin [17] has proven ECP (17) is #P-hard. ECP corresponds to situation where a crowd is exactly accurate.

In detail, for B to-clean edges, there are totally $2^{B}$ cleaning results: $CS={000...000,000...001,000...011,......,111...111}$, each element of which is a B-bits sequence where each bit represents the cleaning result of the corresponding edge. For simplicity, we assume all edges’ existential probabilities of an uncertain graph G are the same value p.

For ECP, the expected query result quality after cleaning is

$$ Q=(1-p)^{B} Q_{1} + (1-p)^{B-1} p Q_{2} + (1-p)^{B-2} p^{2} Q_{3} + ... + p^{B} Q_{2^{B}} $$

(32)

where $Q_{1}$, $Q_{2}$, $Q_{3}$, ..., $Q_{2^{B}}$ are the corresponding query result quality of CS.

For NECP, the expected query result quality after cleaning is

$$\begin{array}{@{}rcl@{}} Q^{N} & =&(1-Pr(C_{r}))^{B} Q_{1} + (1-Pr(C_{r}))^{B-1} Pr(C_{r}) Q_{2}\\ &&+ (1-Pr(C_{r}))^{B-2} (Pr(C_{r}))^{2} Q_{3} + ... + (Pr(C_{r}))^{B} Q_{2^{B}} \end{array} $$

(33)

Simplifying (5), we have

$$\begin{array}{@{}rcl@{}} Pr(C_{r})&=&(2P_{c} - 1)p + 1-P_{c}\\ 1-Pr(C_{r})&=&-(2P_{c} - 1)p + P_{c} \end{array} $$

We can see $Pr(C_{r})$ is the linear expression with respect to p. Therefore, the fact computing $Q^{N}$ implies that we can compute Q shows solving NECP is $\#$P-hard. □

1.3 Calculating $P_{q,e}^{*}$

Proposition 2: Calculating $P_{q,e}^{*}$ can be reduced to calculating reachability.

Proof

Similar to calculating reachability $R_{q}$ in (3) which in theory needs to enumerate all $sg \in SG_{G,q}$, calculating $P_{q,e}^{*}$ accordingly needs to enumerate all $sg \in SG_{G(P_{e},P_{\overline e}),q}$. Also, it is impossible to identify $SG_{G(P_{e},P_{\overline e})}$, even more troublesome than enumerating $SG_{G,q}$.

In Example 2, we have mentioned Monte-carlo method can approximate the result of $Pr(p_{1} \vee p_{2} \vee p_{3} \vee ... \vee p_{n})$ (assume $n=|AP|$). Similarly, Monte-carlo method is also applicative to calculating $P_{q,e}^{*}$. First, we denote all paths passing through (not passing though) edge e by $AP_{e}$ ($AP_{\overline e}$). Then, we have

$$ \sum\limits_{sg \in SG_{G(P_{e},P_{\overline e})}} Pr(sg)=Pr(\bigvee_{p \in AP_{e}})(1-Pr(\bigvee_{p \in AP_{\overline e}})) $$

(34)

By (34), we just need to respectively compute two parts: $Pr(\bigvee _{p \in AP_{e}})$ and $Pr(\bigvee _{p \in AP_{\overline e}})$, calculating each of which is equivalent to calculating reachability. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, Y., Lin, X., Yang, Y. et al. Cleaning uncertain graphs via noisy crowdsourcing. World Wide Web 22, 1523–1553 (2019). https://doi.org/10.1007/s11280-018-0624-8

Download citation

Received: 30 November 2017
Revised: 01 June 2018
Accepted: 18 July 2018
Published: 31 July 2018
Issue Date: 15 July 2019
DOI: https://doi.org/10.1007/s11280-018-0624-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cleaning uncertain graphs via noisy crowdsourcing

Abstract

Access this article

Similar content being viewed by others

A hybrid information-based two-phase expansion algorithm for community detection with imbalanced scales

Graph based anomaly detection and description: a survey

Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

1.1 New Reachability

Proof

1.2 NECP is # P-hard

Proof

1.3 Calculating \(P_{q,e}^{*}\)

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cleaning uncertain graphs via noisy crowdsourcing

Abstract

Access this article

Similar content being viewed by others

A hybrid information-based two-phase expansion algorithm for community detection with imbalanced scales

Graph based anomaly detection and description: a survey

Natural-neighborhood based, label-specific undersampling for imbalanced, multi-label data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Appendix

1.1 New Reachability

Proof

1.2 NECP is # P-hard

Proof

1.3 Calculating \(P_{q,e}^{*}\)

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation