UvA-DARE (Digital Academic Repository) Privacy impact assessment in large-scale digital forensic investigations

The large increase in the collection of location, communication, health data etc. from seized digital devices like mobile phones, tablets, IoT devices, laptops etc. often poses serious privacy risks. To measure privacy risks, privacy impact assessments (PIA) are substantially useful tools and the Directive EU 2016/ 80 (Police Directive) requires their use. While much has been said about PIA methods pursuant to the Regulation EU 2016/679 (GDPR), less has been said about PIA methods pursuant to the Police Directive. Yet, little research has been done to explore and measure privacy risks that are speci ﬁ c to law enforcement activities which necessitate the processing of large amounts of data. This study tries to ﬁ ll this gap by conducting a PIA on a big data forensic platform as a case study. This study also answers the question how a PIA should be carried out for large-scale digital forensic operations and describes the privacy risks, threats we learned from conducting it. Finally, it articulates concrete privacy measures to demonstrate compliance with the Police Directive. © 2020 The Authors. Published by Elsevier Ltd. This is an open access article the CC license


Introduction
The personal data processing of large-scale digital evidence in criminal investigations falls within the scope of Directive (EU) 2016/680 of the European Parliament and of the Council (the socalled Police Directive). This legislative instrument has entered into force on the 5th May 2016 and repeals Council Framework Decision 2008/977/JHA. This directive not only protects individuals' personal data which are being processed for the purposes of the prevention, investigation, detection or prosecution of criminal offences or the execution of criminal penalties but also ensures a high level of public security by the free flow of such data between competent authorities. Member States were expected to transpose the directive into national law from May 2016 to May 2018 (EC & EP, 2016a).
One of the most important provisions of the directive is set out in Article 4(1) (EUR-Lex, 2017) which states that personal data of natural persons must be: a) processed lawfully and fairly; b) collected for specified, explicit and legitimate purposes and processed only in line with these purposes; c) adequate, relevant and not excessive in relation to the purpose in which they are processed; d) accurate and updated where necessary; e) kept in a form which allows identification of the individual for no longer than is necessary for the purpose of the processing; f) appropriately secured, including protection against unauthorised or unlawful processing. The said Article 4 stipulates that Member States shall ensure that the processing is in accordance with the principles of necessity and proportionality (EC & EP, 2016a).
The Police Directive contains various novel provisions to address the limited scope and the outdatedness of the Framework Decision (de Hert and Papakonstantinou, 2016). At first glance, data protection by design and by default are introduced in Article 20 as two of the obligations of the controller. Thus, the competent authorities must take into account these two principles both at the time of the determination of the means for processing and at the time of the processing itself. Another novelty is the notification of a personal data breach to the supervisory authority as stipulated in Article 30. Moreover, designation of a data protection officer is introduced in Article 32 as a new obligation for the controller. Last but not least, the directive provides new rights to data subjects which include right to receive information by the data subject (Article 13), right of access by the data subject (Article 14), and right to rectification or erasure of personal data (Article 16) (Leiser and Custers, 2019).
Whereas there has been an intense debate about the role, impact and practical implementation of the Police Directive, it is clear that the directive is a positive improvement towards a comprehensive data protection in EU (EDPS, 2015). Another interesting debate is that the right to data protection and public security seem to be as competing interests (Europol, 2018). This implementation comes against the background of a debate on how the right to data protection and the right to security can be balanced within a society where the police may at times seemingly give priority to the obligation to keep society safe over privacy and data protection. As underlined by the European Court of Human Rights' case law (Volker und Markus Schecke GbR (C-92/09), Hartmut Eifert (C-93/09) v Land Hessen, 2010), the right to the protection of personal data is not an absolute right; that is, the enjoyment of this right may be limited to ensure that other rights are protected, such as when protecting society from crime and terrorism.
What is missing from the Police Directive is the guidelines on how to successfully implement appropriate safeguards for compliance (Marquenie, 2017). One of the provisions requires data controller to carry out a Data Protection Impact Assessment (DPIA) as addressed in Article 27. DPIAs (previously known as privacy impact assessments (PIAs)) are tools to evaluate the origin, nature, particularity and severity of risks to the rights and freedoms of natural persons and to determine the appropriate measures (EC & EP, 2016b).
How member states determine whether a PIA (DPIA can to some extent be seen as a GDPR checklist and primarily focused on 'data protection' while PIA includes both the right to private life and the right to data protection. Because of its broad scope, we use the term PIA instead of the term DPIA used in the Police Directive) has to be carried out is provided in Article 27(1) as follows:Where a type of processing, in particular, using new technologies, and taking into account the nature, scope, context and purposes of the processing is likely to result in a high risk to the rights and freedoms of natural persons, Member States shall provide for the controller to carry out, prior to the processing, an assessment of the impact of the envisaged processing operations on the protection of personal data. Furthermore, Article 27(2) provides a minimum standard for conducting a PIA:The assessment referred to in paragraph 1 shall contain at least a general description of the envisaged processing operations, an assessment of the risks to the rights and freedoms of data subjects, the measures envisaged to address those risks, safeguards, security measures and mechanisms to ensure the protection of personal data and to demonstrate compliance with this Directive, taking into account the rights and legitimate interests of the data subjects and other persons concerned(EC & EP, 2016a).
Fortunately PIAs have been studied in detail since the mid-1990s (Wadhwa and Rodrigues, 2013). There are plenty of PIA methods which are proposed by researchers, governments, Data Protection Authorities (DPAs) and standards bodies. Yet, more industry/technology-oriented PIA methods are developed, such as the RFID PIA (Spiekermann, 2012) and Smart Grid DPIA template (Smart Grid Task Force, 2012e14 Expert Group 2, 2014. However, it is not the case for the police sector. The costs of using a PIA methodology not considering the unique nature of police activities can be insufficient identification of risks and difficulty in demonstrating compliance with the Police Directive. This paper addresses these issues by evaluating existent PIA methods, providing a comprehensive methodology for digital forensics based on hands-on experience with a particular attention to large-scale processing. This work is an important step towards a better understanding of privacy risks specific to law enforcement processing practices by establishing a baseline for the assessment and treatment of these risks. Lastly, it presents a guide to the implementation of privacy-by-design (PbD) principles in largescale digital forensic investigations.
The remainder of the paper is structured as follows. Section 2 presents Hansken and discusses state of the art. Section 3 describes the methodology. Section 4 gives an overview of the results of the case study. Finally, Section 5 presents our conclusions and our plans for future work.

State of the art
Since we propose to conduct a case study on Hansken, we first give an overview of Hansken in this section.

Introduction to digital forensics as a service (DFaaS)
In this section we present Hansken and its predecessor, socalled Xiraf (an XML Information Retrieval Approach to digital Forensics).
Xiraf was developed by the Netherlands Forensic Institute (NFI) as an XML-based approach to manage and query forensic traces from the high volume of seized digital material. Xiraf executes lots of forensic analysis tools in a systematic way for extracting traces as XML-based outputs. In this way, the outputs of analysis tools are integrated in order to be indexed and queried in a centralized XMLdatabase. Users are able to search and browse the outputs through a web interface (Alink et al., 2006). Its next version was described as a second generation forensic analysis system with new functions such as parallel execution, reduced I/O, distributed processing and more (Bhoedjang et al., 2012). Its latest version is a service based approach named DFaaS. Unlike traditional digital investigation process, the data, the software and the storage and processing capacity are centralized. So, this version provides faster forensic analysis process, sooner trace availability, reduced overhead time and central system that can used by multiple departments concurrently (van Baar et al., 2014).
Hansken is the successor of Xiraf with a capacity of processing three terabytes of data per hour. Three main reasons for developing Hansken are to minimize case lead time, maximize trace coverage and specialization of people involved. Considering the sensitivity of the processed data in such a big data platform, the developers specified eight design principles: (1) Security, (2) Privacy, (3) Transparency, (4) Multi-tenancy, (5) Future proof, (6) Data retention, (7) Reliability, (8) High availability. As a big data solution, Hadoop Distributed File System (HDFS) and Map Reduce were used. Hansken is the first large-scale digital forensic system that is implemented PbD in mind (van Beek et al., 2015).

PIAs in law enforcement and justice sectors
An attempt to draft a comprehensive PIA methodology for law enforcement agencies (LEAs) was made in an European Commission's project which is Visual Analytics for Sense-making in Criminal Intelligence Analysis (VALCRI). In their white paper, Schlehahn et al. (2014) present a comparative analysis of DPIA methodologies of five European countries which are Belgium, France, Germany, Spain and United Kingdom and the Article 29 Working Party Guidelines. Their results show that none of the compared methodologies refer to the application area of Police Directive. Also, they claim that the risk for an interference on fundamental rights always exists on law enforcement processing even if it is legally justified. No matter of which methodology chosen, PIAs should be used to minimize this interference as much as necessary and appropriate safeguards should be built into processes and systems.
In 2014 2017). These initiatives are not at a level of technological development itself but at a 'higher' level, at the legislation level. While one might argue that a PIA at legislation level means that the legal basis of the technological development is privacy friendly/sound, in reality, this level may not be enough and a PIA at technology level may identify different or more risks that are not immediately evident at legislation level.

Methodology
This paper describes how to conduct a PIA in large-scale digital forensic investigations. To this end, we conduct a case study where the big data forensic platform (the so-called DFaaS) developed by the NFI is used as an example to carry out the PIA. This big data forensic solution is called Hansken as mentioned in Section 2.1. To process and investigate multiple terabytes of seized digital material, the NFI has been using DFaaS since December 2010 (van Beek et al., 2015). The reasons for the selection of Hansken are as follows: it processes large volume of forensic data, it has built-in privacy measures for the processed data and its users (forensic investigators), transparency is one of the design principles and ranked third in terms of priority (van Beek et al., 2015). This paper searches for examples of best practice to construct an optimal PIA methodology that best suits Police Directive requirements. For this purpose, we systematically analysed current PIA methods. The review ended in ten fundamental PIA methods:  (Wright, 2013), Systematic Methodology for PIAs (Oetzel and Spiekermann, 2014), UK ICO (United Kingdom's Information Commissioner's Office) COP (Conducting PIAs Code of Practice) (UK ICO, 2014). These methodologies have some similarities like containing a threshold analysis to test the necessity of PIA and threat examples; also some differences like risk identification and evaluation approaches (Vemou and Karyda, 2018).
Each method has its strengths and limitations. After a comparative analysis, we combined the best elements (strong points) of three methodologies; which are CNIL PIA method, Dutch PIA Model and ISO 29134. At a glance, we selected the PIA guidelines that have been recently proposed or updated especially after the GDPR. Since the Police Directive and the GDPR propose similar solutions in many areas (de Hert and Sajfert, 2018), we included the methods being in line with the GDPR. Vemou and Karyda (Vemou and Karyda, 2018) argued that relying on specific legal frameworks may limit PIAs to a compliance check instead of a comprehensive review of privacy issues of a process. Hence, the inclusion of an international standard on PIA; that is ISO/IEC 29134, contributes to mitigate such limitations.
In details, our evaluation is based on the following benchmarks of ten PIA methods: 1 Is the PIA method up-to-date? 2 Is the GDPR used as a legal basis? 3 Does the PIA method provide an automatizing tool? 4 Is a guidance on how to conduct a PIA in big data context given?
Does the PIA method contain a checklist/a set of questions addressing privacy risks/threats of large-scale data processing?
Almost every DPA (or ICO) provides guidelines on the implications of big data for data protection. However, guidance for big data-specific PIA is offered only in UK PIA COP (UK ICO, 2017) and Dutch PIA Model (JenV, 2018) as shown in Table 1. Both of the supplements are based on the existing PIA methodologies, key points and experience gained with Big Data. What is more, Dutch PIA Model is predicated on not only the GDPR but also the Police Directive. It proposes similar procedures for the similar articles (or recitals) with further assistance for the exceptional circumstances in the Police Directive. Similarly, the main criterion for choosing CNIL PIA Method is that it is supported by an automated tool. Such a tool may facilitate comparison, improve standardisation, support enterprise accountability and ease the PIA process to implement compliance, especially in multi-jurisdictional legal environments (Tancock, 2015). Nevertheless, it is argued that none of the known PIA tools are capable of addressing emerging changes in privacy laws and bridging the knowledge gap between lawyers and engineers both of whose contributions are essential for a successful PIA (US Patent App. No. 15/459,909, 2017).
CNIL PIA consists of methodology, template and knowledge bases. Its automated tool visually assists PIA practitioners and provides practicality. Specifically, CNIL's tool is designed to aid the organizations in building compliance (CNIL, 2017), to facilitate commenting on and validating PIA-related issues and to increase stakeholder involvement. It is noteworthy that the tool itself does not automate the PIA process, instead it guides PIA steps, creates PIA report, risk overview and risk mapping automatically. Also, it is important not to focus solely on the threats stated in the guideline. ISO 29134 adapted a checklist approach. Referring to other ISO standards (e.g., ISO/IEC 29151, ISO/IEC 27002) in several articles might cause PIA practitioners to loose time as it might be hard to follow. Dutch PIA Model has seventeen points explained in details for performing a PIA and a big data-specific supplement. It is a problem that there is no English translation at the moment since it is produced for proposed regulations and data processing by the government.

Results and discussion
This study seeks to understand de facto privacy risks in a centralized forensic platform processing large amounts of personal data. So, the privacy risks which are solely specific to Hansken and the processing of the NFI are irrelevant and beyond the scope of this paper. Instead we generalize our findings to other similar platforms. Likewise, we do not evaluate the privacy risks since they are dependent on the priorities and risk criteria of LEAs.
It should be noted that the processing of the NFI falls in reality under the GDPR. This is because the Ministry of Justice and Security has no independent powers for the prevention, investigation, detection and prosecution of indictable offences or the enforcement of punishments and no link can be made with the "competent authority" as stated in the Directive. However, the forensic investigations in the majority of the member states (e.g., France, Belgium, Spain) are performed by a police unit and the aforementioned processings fall under the Police Directive. Therefore, we limit our case study to the Police Directive.

General description of the envisaged processing operations
Hansken processes special categories of personal data and personal data relating to criminal convictions and offences on a large scale. This processing is likely to result in high risk to the rights and freedoms of natural persons. Also, an amendment of the Police Data Act (Dutch: Wet politiegegevens (Wpg)) and the Judicial and Criminal Records Act (Dutch: Wet justiti€ ele gegevens (Wjsg)) to implement the Police Directive has entered into force on January 1, 2019 after DFaaS has become a standard for criminal cases (December 2010) (Decree implementing the Directive data protection investigation and prosecution, 2019). The processing within Hansken is expected to be in conformity with these new provisions. For these reasons, a PIA should be carried out.
The NFI developed Hansken as a digital search engine for processing and investigating high volumes of (seized) digital material. Its goal is to give the right people access to the right information at the right time. With Hansken, an investigator can quickly and efficiently search for traces in large quantities of seized data carriers such as computers and mobile phones. Anything that may be relevant can be searched for, for example words and names or properties of traces such as chat-messages, emails or photos, whether or not taken with a certain camera. Its processing provides a significant contribution towards finding the truth in criminal cases.
Hansken is used for case investigations on request and under the direction of the police and the Public Prosecution Service (PPS) (Dutch: Openbaar Ministerie (OM)). For this task, the NFI, the police and the PPS are considered as joint controllers. The data for this task are supplied by the police and/or other investigative authorities. Other usages are development of libraries and software and giving courses on how to use Hansken. For these tasks, the NFI is considered as the controller. These core tasks fall under Regulation of the Minister of Security and Justice, dated 8 May 2012, no. 227774, containing provisions regarding the assignment of the NFI (Regulation of NFI duties) (Regeling taken NFI, 2012).
Possible data subjects are convict(s), suspect(s), victim(s), witness(es) and third parties who have nothing to do with the investigation, or persons who are wrongly suspected. With Hansken, all categories of personal data can be processed. For example, information from a mobile telephone contains among other things photograph, names, phone numbers, e-mails, meta-data, location data, payment data, video, data concerning religious conceptions or health etc. Thus, the processing can therefore concern common personal data and sensitive personal data (Recital 37 of the Police Directive (EC & EP, 2016a)). The persons who access the case data are officials from Hansken team.
Within Hansken framework, there is a wide variety of supporting assets, ranging from forensic analysis tools to big data solutions. The forensic tools collection consists of existing forensic tools, both publicly available tools such as UFED, EnCase, FTK, EXIF etc. and tools that is developed in-house. HDFS, Map Reduce, Cassandra, HBase, Elastic Search, Kafka are some of the modules/ components used in DFaaS architecture. Discussing all these assets in detail goes beyond the scope of this paper.
Strategies for privacy and security are based on the commandments from the Jericho Forum. The Jericho Forum consisting of IT customers and vendor organisations proposes a new security model called de-perimeterisation instead of central protection (Lacey, 2005). The goal of de-perimeterisation is to make information flows boundaryless by using encryption, inherently secure computer protocols, inherently secure computer systems and data level authentication (van Beek et al., 2015). Apart from that, rolebased access control (RBAC) is used for identity and user management. With RBAC, access rights are linked to roles within the organization or business process. System users obtain access rights by fulfilling a certain role.
Once the data are read from the seized material, it becomes encrypted. Stored data are encrypted too. The encryption keys are stored in a different domain and separated from the encrypted image. All requests to the central service like authentication, authorization, data uploads, forensic queries, content retrievals are logged. Any privacy-sensitive information in log messages is removed by replacing identifying (tagged) information with anonymized (irreversible) or deidentified (reversible) values. To reverse deidentified values, access to the cryptographic keys is required. Hansken uses HDFS which ensures data availability with three replicas per file by default. Additionally, Hansken works on a copy of the seized material, so the original data are still available for recovery.
Indexing and analysing high volume of seized digital material are main interests for Hansken's usage. These usages are necessary, otherwise searching for traces would be time consuming and detection capacity would be scarce. Based on the data processed by Hansken, no automated decisions producing legal effects for the people concerned or significantly affects him or her, are taken. Final decisions regarding data subjects are always taken with human  (2)). Within Hansken there is a tool to exclude confidentiality communication. That tool works as follows: The defence lawyer provides a list of keywords, files, folders etc. that are highly likely to contain confidentiality communication. The traces having one or more hits according to the list are given the status 'marked' (suspected). Furthermore, when someone investigating data in Hansken encounters suspected confidentiality communication, he/ she gives the trace the status 'marked' too. An assessor, an employee who is not involved in the investigation and has specific authorizations, then decides whether confidentiality communication is indeed involved. If that is the case, the trace is given the status 'confirmed'. If there is no confidentiality communication, the trace will receive the status "rejected". For Hansken users such as the investigators, only the traces with the status 'unmarked' and 'rejected' can be requested (ECLI: Netherlands: RBAMS: 2018:2504). The functionality of this tool is in compliance with the instruction manual adopted by National Assembly Investigation Officers (National Assembly Investigation Officers (Dutch. Landelijke Vergadering Rechercheofficieren), 2014).
Hansken may be used for profiling the suspects. For instance, location and behavioural characteristics of a suspect can be determined on the basis of the digital evidences. Algorithms and techniques used in Hansken have been scientifically tested, as shown in publications or peer reviews. Some of them are publicly available like firearm detection and geodata extraction algorithms. The others are not released publicly because this can potentially harm the detection process. However, insight might be given to the defence as to how the data have been processed in a certain case, as allowed by the examining magistrate. Hansken uses big data: by using large amounts of structured and unstructured data from different sources, data are analysed to look for traces and correlations that can provide knowledge for investigations.
Regulation of NFI duties, Criminal Experts Act (Dutch: Besluit register deskundige in strafzaken (DIS)), Code of Criminal Procedure (WvSv), conventions such as the Prüm Treaty and International Legal Assistance Convention (Dutch: Internationale Rechtshulpverdragen) provide the necessary legal grounds for the processing within Hansken. In addition, in the Ennetcom-case (ECLI: Netherlands: RBAMS: 2018:2504), the court ruled that the results obtained from Hansken are not unreliable, that the procedures have been sufficiently controllable by the defence and that the use of such helps does not require any additional legal provisions. The obligation for the NFI to process personal data may, if appropriate, arise from Criminal Experts Act. On the basis of this law, experts of the NFI can receive instructions for carrying out an expert investigation from the examining magistrate or the public prosecutor. The personal data are not processed for another purpose then for which it has been collected, namely for the detection of criminal offences. Personal data from a specific case are not combined with personal data from another case, unless the PPS has given specific permission to the NFI.
Having access to and being able to analyse large amount of data are crucial in the process of truth-finding. For analysing such large volume of forensic data within a reasonable time, there is no less intrusive way than using Hansken. Therefore, it can be judged that the means are proportionate to the legitimate aim pursued. Furthermore, the processing purposes cannot be achieved if fewer data were processed in Hansken. For finding the truth in (criminal) cases it is of great importance that not too little data are collected, because precisely as complete a picture as possible has to be created of a situation/suspect, also to relieve the suspect. A relevant research question can be for example: Does X appear in the file? Such a question can only be answered by searching through all available material. All data processed in Hansken are necessary for achieving the goal, precisely because links are sought in a process that can serve as evidence. It is not always possible to determine in advance which/what type of data and which person is involved. In that sense, "data minimization" cannot be met.
The NFI has been accredited by the Dutch Accreditation Council in a number of fields based on EN ISO 15189:2012, EN ISO/IEC 17025:2005, EN ISO/IEC 17020:2012 (Dutch Accreditation Council, 2019) and complies with The National Government Information Security Baseline 2017 (Dutch: Baseline Informatiebeveiliging Rijksdienst (BIR)) according to the "comply or explain" principle.

Assessment of the risks to the rights and freedoms of data subjects
The privacy risks specified in this section are in line with CNIL PIA Methodology and ISO 29134 (CNIL, 2017; ISO, 2017).

Illegitimate access to data
Illegitimate access may lead to considerable damages to the data subjects due to the amount and privacy sensitivity of the data such as discrimination, damage to reputation, financial loss etc. Further, any unauthorised disclosure may have a negative influence on the discovery of truth in criminal investigations. For instance, a situation where a suspect finds out by means of an unauthorized access that he/she is under secret investigation may frustrate the investigation.
Some main threats that can lead to an illegitimate access are as follows: (1) Data process/read for wrong case. (2) Unencrypted data transmission from third parties. (3) Unauthorized person access to the big data forensic platform. (4) Investigation report (paper documents) sent to wrong destination. (5) Access to data after case is closed. (6) No systematic monitoring of authorizations. (7) Illegitimate cross-referencing of data (ISO, 2017).

Unwanted change of data
Unwanted change of data may cause the big data forensic platform to fail to operate correctly. Also, the processing could be misused for evidence manipulation; a piece of evidence might be altered in other valid data such as data about location or movements, economic situation, etc. As a result, there might be occasions where someone commits a crime and an innocent person is accused of it and personal data are processed in a manner that is incompatible with specified and legitimate purposes.
Some main threats that can lead to unwanted change of data access are as follows: (1) Errors during updates, configuration or maintenance (ISO, 2017). (2) Malicious code injection (CNIL, 2017).

Disappearance of data
Disappearance of data may dramatically effect data availability. In digital investigations, data availability should be high, preferably 24/7, otherwise the amount of evidence found in digital material will decrease and time needed to solve a case will increase (van Beek et al., 2015). Like unwanted change of data, data disappearance may cause to malfunction. In consequence, the outcomes of these feared events may produce adverse legal effects concerning the data subject.
Some main threats that can lead to disappearance of data are as follows: (1)

The measures envisaged to address the risks
In this section, we discuss appropriate measures for a big data forensic platform, as summarised in Table 2, to address the privacy risks identified in the previous section.
1 Access to the platform should be permitted only to personnel who possess a security clearance. 2 If an investigator is no longer working on a case, his/her access to the case data should be immediately withdrawn. 3 LEAs should impose strict data retention periods in accordance with the requirements of all applicable legislation. In case of a need for longer retention periods, the data should be anonymized. Also, specifications on how to destroy the personal data in a secure manner should be developed.
Procedural measures have to be in place to ensure retention and destroy policies are respected.
4 Case data and queries made by investigators for searching traces should be kept encrypted. 5 Software used in the platform especially forensic analysis tools should be analysed with regard to privacy. 6 Confidentiality communication (e.g., information covered by legal professional privilege) should be excluded as evidence for criminal prosecution. Unfortunately, there does not exist a forensically sound procedure that is able to guarantee the protection of confidentiality communication and current practices require manual intervention (Jiang et al., 2013). Hence, the disclosure of such information should, as far as possible, be protected. To this end, a list of keywords to filter out the relating privileged data might be specified in advance. The defence lawyer might be consulted for specifying the possible keywords. Search for these keywords should only be allowed with prior justification (Naudts, 2018). It is possible that the traces are not correctly recognized as privileged and not filtered out during the investigation. In this case, an investigator might mark traces as privileged and filter them out immediately. Likewise, it is also possible that some data are incorrectly classified as privileged. In this case, an officer being not involved in the investigation may be designated and he/she may restore non-privileged data. 7 To reduce the risk impact, it should be ensured that any disclosure of personal data is detected as soon as possible Best Practice and supervisory authority (e.g., PPS or DPA) is informed accordingly. 8 Datasets used for training and software testing should be anonymized. 9 Strict access control policies should be in place to limit the risk of unauthorised access to personal data. 10 Any user/system actions should be logged, attributed to diagnose any privacy breaches and to preserve the chain of evidence. Article 25 of the Police Directive sets out how detailed the logs should be as follows: the justification, date and time of such operations and, as far as possible, the identification of the person who consulted or disclosed personal data, and the identity of the recipients of such personal data. The said article also limits the usage of logs. To protect the privacy of employees, logs might be pseudonymised too. 11 To implement the principle of purpose limitation, the big data forensic platform should keep an explicit audit trail of user actions. Audit trails may verify that the actions of the forensic investigator are within the scope of a warrant/court order and also reinforce the evidence reliability (Adams, 2008). To achieve a higher-level of auditing, unique user profiles which are tied to individual analysts and allow for varying degrees of access to data based on the tasks assigned and clearance given may be provided (Marquenie and Coudert, 2017). 12 It is important that the case data (seized digital material) are handled meticulously and carefully throughout the entire investigation: both when collecting data from third parties and when transferring it to the platform. The original digital material should be kept outside of the platform's processing, an exact copy should be made and used for searching traces.
To monitor its integrity and minimize the errors, a hash function and a message authentication code (MAC) could be used. 13 To ensure data accuracy, the data must be categorized according to its reliability. For this purpose, 4x4x4 (Belgium) or 5x5x5 (UK) grid structures could be used as an example. For instance, in UK 's structure (College of Policing, 2019), the data are described in 5 categories as follows: (A) Known directly to the source, (B) Known indirectly to the source but corroborated, (C) Known indirectly to the source, (D) Not known, (E) Suspected to be false (Marquenie and Coudert, 2017). By the same token, it is desirable to clearly distinguish primary data sources (the sources where data are actually generated) from secondary ones (sources that link existing data sets and (re)use them) (JenV, 2018). 14 A distinction between different categories of data subjects should be drawn. The personal data of convicts, suspects, victims, witnesses and third parties should be treated differently as addressed in Article 6 of the Police Directive. For instance, different degree of anonymization (e.g., irreversible or de-identified (reversible)) may be used for different categories of data subjects. 15 The potential discriminatory factors (e.g., the training data, the learning algorithm etc.) should be tested whether they create biases for natural persons (especially for profiling). If so, they shall be prohibited. 16 Algorithms and techniques which are scientifically derived and proven should be used in the platform. The margin of error associated with them should be determined by taking into account their potential impact on the natural persons (JenV, 2018). 17 Competent authorities could be encouraged in publishing their PIA reports (executive summary) on a publicly available platform.
4.3.1. Discussion of principles relating to processing of personal data in law enforcement sector The processing of personal data in a big data forensic platform is not transparent. The data subjects such as suspects do not always know if and how their personal data are being processed. So, they cannot be asked to give their consent to the processing. They also lack insight into the accuracy and completeness of their personal data. The personal data of the persons who have nothing to do with the criminal investigation might also be processed in such a platform. These persons are not able to make use of their rights under Police Directive. It is actually impracticable for the competent authorities to be transparent about data processing of natural persons, as this involves a disproportionate amount of time and effort and the process of truth-finding may be frustrated. Since it is difficult to determine in advance which information is relevant to the case and which is not, data minimization might not be implemented in the criminal investigations.
Because the persons involved are often unable to exercise some of their rights, it is of great importance that the processing in a big data forensic platform is subject to an external control/audit by the supervisory authority. Additionally, the competent authorities might provide a clear explanation concerning the use of personal data in relation to forensic investigation within the big data platform in their website and/or social media account. In this context, an explanation of how to challenge the decisions made with the help of the platform may be put on the website for future reference.

Conclusion
In this paper, we described how to conduct a PIA on a big data forensic platform. To this end, we compared several PIA methods and selected three that best suit our requirements, since there do not exist a PIA methodology that is in conformity with the Police Directive with a focus on law enforcement activities.
Our study demonstrates firstly the importance of conducting a PIA for all forensic platforms. Seized digital material may contain large amounts of common and sensitive personal data of everyone involved in a crime. Hence, the processing for forensic purposes is more likely to result in an interference in the fundamental rights of data subjects. PIAs may be of benefit to minimize this interference. Secondly, the findings from this study strengthen the position that privacy correlates with security. Necessary measures should also be taken to ensure the security of such forensic platforms.
This case study reveals that threats in police sector that can lead to privacy risks are rather different from those in other sectors. Collaboration between the investigators and PIA practitioners is crucial in precisely specifying these threats. The implementation of a PIA encourages privacy awareness within the investigators and the developers of a big data forensic platform. It is worth noting that PIAs should be carried out before the development of such platforms. Whenever a shift in privacy objectives takes place during the design phase, LEAs should repeat the PIA to address new privacy risks.
In future work, we plan to further investigate how to improve the implementation of PbD in large-scale digital forensic investigations without reducing the effectiveness and the speed of investigations.
through Networked Technologies, Information policy And Law (Essential) Project funded under the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie [grant agreement no 722482].