A method of record linkage.

In cancer epidemiology, prospective approaches are very important both in testing etiological hypotheses and in evaluating preventive procedures. Prospective studies, however, are very difficult and expensive, because a large number of people and a long period of observation are necessary for a satisfactory study. As a data source for follow-up studies, population-based cancer registry is very useful. The Osaka Cancer Registry has been in operation since December, 1962. Since 1968 the data processing, including the work of collation, has been semicomputerized. In order to identify cancer patients, we use the following six indices: date of birth, first Chinese character of a person's family name, address a: city, ward, town or village, address b: further details. i.e., street, avenue, section, hamlet etc., site, and sex. When we have data on the collation indices for the subjects to be followed up, we can conduct follow-up studies easily and accurately, using a semicomputerized collation method similar to that in the cancer registration system. Because the master file of the Osaka Cancer Registry contains the data of cancer cases reported and all cancer deaths among the residents of Osaka Prefecture, we can follow up the subjects living in Osaka Prefecture and obtain data about vitually all cancer incidences and deaths among them. In this follow-up method by means of record linkage to the cancer registry, some considerations should be taken into account for the following factors; coverage of cancer data in the Osaka Cancer Registry, reliability of the collation method, and address of the subjects to be followed up. As an example of a study with this method, we present the follow-up study of the screenees of a mass screening program for stomach cancer.


Introduction
In cancer epidemiology, prospective approaches are very important both in testing etiological hypotheses and in evaluating preventive procedures (1). Prospective studies, however, are very difficult and expensive, because a large number of people and a long period of observation are necessary for a satisfactory study. For a follow-up study in Japan we have three kinds of data sources, the systems of which are routinely maintained for other purposes.
First, there is a system of official family records, the Koseki, containing the Honseki, i.e., the legal address. By knowing the recent Koseki record of a person, we can obtain information on his survival status: whether he is alive or dead. But it takes a great deal of work to carry out an inquiry into Koseki records one by one for a large number of people, because the Koseki system has not yet been computerized.
Second, the health centers in Japan routinely collect and keep a copy of the death certificate for every dead person whose address comes within theirjurisdiction. These data are useful for a follow-up study. Since 1969 the data from death certificates have been transferred on to magnetic tapes with the optical mark reader in the Health and Welfare Statistics, Information Department of the Ministry of Health and Welfare. Unfortunately, however, the name of a dead person is not included in the items for input into the tape, and we have no identification number in Japan, such as the National Health Service number in the United Kingdom and the Social Security number in the United States. As a result, follow-up studies by collating the file of subjects to be followed up and the mortality data file with a computer are impossible. It takes a great deal of work to follow up a large number of people one by one, using the copy of the death certificate which is kept in the health center.
As a third data source for a follow-up study we have a cancer registry system. As of 1977, eleven prefectural cancer registries are in active operation. Though most of the registries have very limited capacities because of their insufficient budget and staffing, some registries are now capable of conducting various follow-up studies using their registered data.
In this paper, we introduce, first, the cancer registration system in Osaka, and, then, the follow-up method by means of semicomputerized record linkage to the cancer registry. This follow-up system has recently been set up by the Osaka Cancer Registry (2).

Cancer Registration System in Osaka
Outline of the Osaka Cancer Registry Cancer registration in Osaka (population: 8.3 million as of 1975) has been in operation since December, 1962. The objectives of registration are as follows: (1) to provide information concerning the size and nature ofthe cancer problem in Osaka and to present data for cancer epidmiology; (2) to provide information concerning the medical care and prognosis of cancer patients and to assist in making cancer control projects and in evaluating their effectiveness; and (3) to return its statistics and informations about each patient's prognosis to each participating hospital and clinic and to assist in raising the level of their medical care for cancer patients.  sorting the two files to be collated in order of the patient's date ofbirth, and then by sorting them again in order of the patient's address. But because of the increasing number of cases registered, the collation work has become too large for a small registry staff.
In 1968 the registry decided to computerize all tasks, including the collation work. Figure 2 shows the computer system flow chart for collation, now in operation. In Step 1, the new report fie, collations are made against each other to check duplication; in Step 2, the new report file in which duplication has been checked at step 1 is collated with the master file, and these two files are then combined into a new master file; in Step 3, a new cancer death file is collated with the master file in order to check the survival status of the previously registered cases and to check the cases registered by death certificate only.
Step 1 Step 2" In Osaka, there are five medical schools, 456 hospitals, and 6077 private clinics; 11,423 physicians were working as of 1975. With so many participants in the cancer registrations, there is a risk of misinformation being included on indices submitted for collation. Thus, we decided to use combinations of several indices for identification ofcancer cases. The following six indices are used; (1) date of birth, (2) first Chinese character of a person's family name, (3) address a (city, ward, town or village), (4) address b (further details, i.e., street, avenue, section, hamlet), (5) collation site, and (6) sex.
In Osaka, a post card has been used as a cancer report form. Sometimes doctors write only a part of the patient's name out of considerations of confidence. Therefore, only the first Chinese character of the patient's family name can be used for identifica- tion index. Each Chinese character is given its own particular pronunciation and encoded under letters of the Roman alphabet. A collation site code is added automatically by a computer to every cases according to the primary site code. For example, to all cases whose primary site codes are 180-182 or 2340 (I.C.D.), 180 is given as the collation site code. Table 2 shows the 29 combinations of the agreement or disagreement of the six indices, which were selected empirically as those of possible matching pairs. All pairs coming under these combinations are printed out on the computer list. The detailed flow chart of collation is shown in Figure 3. All the possible pairs printed out by the computer are examined by a manual check of the original cards in order to determine whether or not the two data belong to the same patient. Table 3 shows the number of pairs printed out and the number of cases which were confirmed manually by their combination number when we checked the matching efficiency using the data for 1968. From 224 this result, we decided to discard thereafter combinations 21, 23, 24, and 25, and to add conditions for some combination numbers. The reason for this is that the date of birth is divided into four parts: (1) the era name of the emperor when the patient was born, (2) year of birth according to the Japanese calendar, (3) month, and (4) date. For combination numbers 11, 20, and 22, the possible matching pairs are to be printed out only when three of the four factors of their birth dates agree.
Pairs of combination number 11, 20, and 22 are to be printed out when the birth date or address b of one card is not given. For combination number 26, pairs are to be printed out only when the name of one card is not given.
As a result of these modifications, the number of pairs printed out on the list in step 3 with the 1968 data decreased to 39.5% of those on the previous computer list, and the number of confirmed cases remained 98.9o of those before the modifications (Table 4).
In order to check the reliability, both the manual method of collation and the computer method were used in the collation of the cancer reports diagnosed in 1968 with the cancer death cards of 1968. As shown in Table 5, there were 2710 confirmed matching pairs by either method. The computer method overlooked 2.4% of all the matching pairs, which was a little higher than the percentage of pairs that was overlooked by the manual method (1.2%). e in or( birth I F I But because of the increased number of data to be collated, the manual method is now impracticable for a small registry staff. Instead, by the semicomputerized method described above, it has become possible to maintain a high degree ofreliability for all collations even when a large number of data is to be collated.  Step I Step 2 Step 3 "The pairs for which no combination numbers were given were not printed out on the list. Using the semicomputerized collation method similar to that in the cancer registration system, we can conduct various follow-up studies when we have data on the collation indices for the subjects to be followed up. Figure 4 shows the follow-up method by means of record linkage to the Osaka Cancer Registry. Because the master file of the registry contains the data of cancer cases reported and all cancer deaths among the residents of Osaka Prefecture, we can follow up all the subjects living in Osaka Prefecture and obtain data about virtually all cancer incidences and deaths occurring among them.
By this method we can conduct follow-up studies for a large number of subjects easily and accurately for a long period, but the following factors should be kept in mind.

Coverage of Cancer Data in the Osaka Cancer
Registry. By this method we cannot obtain data about cancer incidences or deaths of persons where there were no reports to the registry even if they suffered from cancer or died of it. The coverage of cancer incidence data in the registry is dependent on the degree ofparticipation in the registration by each hospital or physician. In 1966, we estimated that the coverage of cancer incident cases in the registry was about 90%o. As for cancer deaths, we can collect all data from the Department of Health of the Osaka Prefectural Government.
Reliability of the CoUation Method. As described above, the percentage of pairs that are overlooked by this semicomputerized method is about 2%.
Address of the Subjects to Be FoUowed Up. The Osaka Cancer Registry contains the data of cancer incidences or deaths occurring only among the residents of Osaka Prefecture. Therefore, we can follow up by this method only those subjects whose addresses fall within the prefecture. When the subjects move outside the prefecture during the follow-up period, we cannot obtain cancer data about them and they are regarded as being free of cancer.  In analyzing data obtained from this follow-up study, special considerations should be given to these factors.

Follow-Up Study of the Screenees of a Mass Screening Program for Stomach Cancer
As an example of a follow-up study by means of record linkage to the cancer registry, we describe in this section a follow-up study of the screenees of a mass screening program for stomach cancer.
In evaluating the screening program, it is necessary to follow up all the screenees and to show the decrease in mortality from stomach cancer among them (3). But the actual follow-up of a large number of screenees requires a great deal of work. Instead, we followed up the screenees by means of record linkage to the Osaka Cancer Registry.
The study subjects are the 32,789 screenees of the stomach cancer screening tests conducted by the Center for Adult Diseases, Osaka between 1967 and 1970. Those who lived outside Osaka Prefecture and whose stomach had been resected before the initial screening are excluded from the study subjects. The screening was carried out with a specially designed mobile unit equipped with a photofluorographic apparatus. As shown in Table 6, about 20%'o of the total Overlooked by the manual method 32 1.2 "These data are the results ofcollation ofthe cancer reports diagnosed in 1968 with the cancer death cards of 1968. Those overlooked by the computer method are due to; (a) combination numbers not given, (b) disuse ofcertain combination numbers and (c) addition ofcertain conditions to certain combination numbers. "early" stomach cancer cases. By linking the file of the screenees with that of the Osaka Cancer Registry, we obtained data about stomach cancer cases and deaths after the initial screening. The follow-up period was from the initial screening through December 31, 1975, and the average follow-up period was 6.1 years. Table 7 shows the number of stomach cancer cases diagnosed during the follow-up period. In addition to the 123 stomach cancer cases detected at the initial screening, 212 further cases in total were diagnosed after the initial screening. Seventy cases were detected by the repeat screening and the remaining 142 cases were diagnosed outside the screening. The data of most cases diagnosed outside the screening were obtained for the first time by the record linkage to the cancer registry. The percentage of those who were screened only once during the follow-up period among the total screenees was 45.2% in males and 62.2% in females. The average number of screenings per screenee during the follow-up period was 2.66 in males and 1.92 in females. The number of stomach cancer deaths among the study subjects during the follow-up period is shown in Table 8. There were 63 deaths among the 123 stomach cancer cases detected at the initial screen-ing, 25 deaths among the 70 cases detected at the repeat screening and 91 deaths among the 142 cases not diagnosed by screening. In total, 179 stomach cancer deaths were observed among the study subjects during the follow-up period.
In order to obtain the expected number of deaths for comparison with the observed number of deaths, the person-years observed during the follow-up period were calculated by sex and age-group. In this calculation, consideration was given to such factors as aging and dropping out from observation due to death and moving out of Osaka Prefecture. The probability of death was obtained from the life table calculated on the basis of the 1970 mortality statistics. The rate of moving out of Osaka Prefecture was estimated from the 2% sample actually followed up, where the mail survey and Koseki survey were conducted. Then, the age-specific mortality rate of stomach cancer in the general population of Osaka Prefecture was applied to the person-years of each age-group. The stomach cancer mortality rates used were the average ofthose for 1969-1971 and those for 1972-1974, obtained from the Osaka Cancer Registry. We also calculated the expected numbers of stomach cancer cases, deaths from cancer of digestive organs and deaths from cancer of all sites by similar methods and compared them with the observed numbers in order to examine whether there are any biases in the study subjects. Table 9 shows the comparison of the observed  number with the expected one when the study sub-cancers of all sites shows that there might be some jects are divided by age-groups at the initial screen-biases in the study subjects aged 40 to 59. ing. As for those aged 40 to 59 years, the O/E ratio in Table 9 also shows the O/E ratios for stomach stomach cancer death was 0.72 in males, 0.79 in cancer deaths and cases among the study subjects of females and 0.74 in total. Namely, there was about a other age-groups. Though the observed numbers are 20 to 30%o decrease in mortality from stomach cancer small, the tendency can be seen for the effect of the in this age-group. This definitely demonstrates the screening program in reducing stomach cancer effect of the screening program. The O/E ratio of deaths to be more pronounced in younger age-groups stomach cancer cases was 1.25 in males, 1.57 in than in older ones. females and 1.35 in total. When only advanced We are planning to continue this follow-up study stomach cancer cases were considered, the ratio was for a fairly long period in the hope that the knowledge 0.94 in males, 1.20 in females and 1.02 in total. These from it will be useful both in elucidating the natural figures show that there were about the same number history of stomach cancer and in answering practical of stomach cancer cases in the study subjects as in questions such as the optimal frequency and interval the general population. But the O/E ratio in death for the screening program. from cancers of digestive organs and in deaths from

Summary
In this paper, we have introduced, first, the cancer registration system in Osaka, and then, the follow-up method by means of record linkage to the cancer registry.
As an example of a study with this method, we have next presented the follow-up study of the screenees of a mass screening program for stomach cancer.
By this method, we are now able to conduct various prospective studies, concurrent or nonconcurrent, in order to set up or to test etiological hypotheses of cancer. We are planning to collect various lists of the population at risk, where the identification indices for collation are included.