A multi-centric dataset on patient-individual pathological lymph node involvement in head and neck squamous cell carcinoma

Dataset We provide a dataset on lymph node metastases in 968 patients with newly diagnosed head and neck squamous cell carcinoma (HNSCC). All patients received neck dissection and we report the number of metastatic versus investigated lymph nodes per lymph node level (LNL) for every individual patient. Additionally, clinicopathological factors including T-category, primary tumor subsite (ICD-O-3 code), age, and sex are reported for all patients. The data is provided as three datasets: Dataset 1 contains 373 HNSCC patients treated at Centre Léon Bérard (CLB), France, with primary tumor location in the oral cavity, oropharynx, hypopharynx, and larynx. Dataset 2 contains 332 HNSCC patients treated at the Inselspital, Bern University Hospital (ISB), Switzerland with primary tumor location in the oral cavity, oropharynx, hypopharynx, and larynx. For these patients, additional information is provided including lateralization of the primary tumor, size and location of the largest metastases, and clinical involvement based on computed tomography (CT), magnetic resonance imaging (MRI), and/or 18FDG-positron emission tomography (PET/CT) imaging. Dataset 3 consists of 263 oropharyngeal SCC patients underlying a previous publication by Bauwens et al. [1], which were treated at CLB. For these patients, additional information including HPV status, lateralization of the primary tumor and clinically diagnosed lymph node involvement is provided. Reuse Potential The data may be used to quantify the probability of occult lymph node metastases in each LNL, depending on an individual patient's characteristics of the primary tumor and the location of clinically diagnosed lymph node metastases. As such, the data may contribute to further personalize the elective treatment of the neck for HNSCC patients, i.e. definition of the elective clinical target volume (CTV-N) in radiotherapy (RT) and the extent of neck dissection (ND) in surgery. There exists only one similar publicly available dataset that reports clinical involvement per LNL in 287 oropharyngeal SCC patients [2]. The data presented in this article substantially extends the available data, it additionally includes pathologically assessed involvement per LNL, and it provides data for multiple subsites in the head and neck region.

a b s t r a c t Dataset: We provide a dataset on lymph node metastases in 968 patients with newly diagnosed head and neck squamous cell carcinoma (HNSCC).All patients received neck dissection and we report the number of metastatic versus investigated lymph nodes per lymph node level (LNL) for every individual patient.Additionally, clinicopathological factors including Tcategory, primary tumor subsite (ICD-O-3 code), age, and sex are reported for all patients.The data is provided as three datasets: Dataset 1 contains 373 HNSCC patients treated at Centre Léon Bérard (CLB), France, with primary tumor location in the oral cavity, oropharynx, hypopharynx, and larynx.Dataset 2 contains 332 HNSCC patients treated at the Inselspital, Bern University Hospital (ISB), Switzerland with primary tumor location in the oral cavity, oropharynx, hypopharynx, and larynx.For these patients, additional information is provided including lateralization of the primary tumor, size and location of the largest metastases, and clinical involvement based on computed tomography (CT), magnetic resonance imaging (MRI), and/or 18FDGpositron emission tomography (PET/CT) imaging.Dataset 3 consists of 263 oropharyngeal SCC patients underlying a previous publication by Bauwens et al. [1] , which were treated at CLB.For these patients, additional information including HPV status, lateralization of the primary tumor and clinically diagnosed lymph node involvement is provided.Reuse Potential: The data may be used to quantify the probability of occult lymph node metastases in each LNL, depending on an individual patient's characteristics of the primary tumor and the location of clinically diagnosed lymph node metastases.As such, the data may contribute to further personalize the elective treatment of the neck for HNSCC patients, i.e. definition of the elective clinical target volume (CTV-N) in radiotherapy (RT) and the extent of neck dissection (ND) in surgery.There exists only one similar publicly available dataset that reports clinical involvement per LNL in 287 oropharyngeal SCC patients [2] .The data presented in this article substantially extends the available data, it additionally includes pathologically assessed involvement per LNL, and it provides data for multiple subsites in the head and neck region.
© 2023 The Author(s

Background
Treatment of HNSCC patients currently includes elective RT or prophylactic dissection of large parts of the soft neck tissue, which is at risk of harboring occult lymph node metastases that are not clinically/radiologically detectable [3] .Current clinical guidelines on elective nodal RT and neck dissection are mostly based on the prevalence of lymph node metastases in a lymph node level for a given primary tumor location.The overarching goal of this research is to better quantify the risk of occult metastases in each lymph node level based on an individual patient's clinically/radiologically diagnosed state of disease.This may lead to further personalization of neck dissection procedures and the definition of the nodal clinical target volume in RT, also considering the location of clinically/radiologically detected lymph node metastases and characteristics of the primary tumor such as T-category and lateralization of the primary tumor.Detailed datasets reporting lymph node involvement together with clinicopathological factors on a patient-individual level are the basis and a necessary requirement for achieving this goal.This publication presents the largest publicly available such dataset.It can be interactively explored and visualized via the previously developed platform https://lyprox.org .

Value of the Data
• The dataset containing 968 patients represents a substantial addition to a previously published dataset of 287 patients with SCC in the oropharynx [2] .It adds patients with primary tumors located in the oral cavity, hypopharynx and larynx.• Since pathology after neck dissection is the gold standard for investigating whether occult disease was present in a lymph node, the data is of particularly high quality and an essential addition to the previous dataset [2] reporting only clinical involvement.For parts of the data, both clinical and pathological involvement is provided, containing information on sensitivity and specificity of clinical detection of lymph node metastases.• Researchers and clinicians working in the field of head and neck cancer, who are interested in the lymphatic spread of the disease and how to manage its related risks may benefit from the publications of these data.The data is the basis for further personalization of neck dissection procedures and elective nodal RT. • Ultimately, HNSCC patients may benefit from further personalized treatments that better balance the risk of treatment side effects versus the risk of tumor recurrence.
• The data may be used to quantify the probability of occult lymph node metastases in each LNL, depending on an individual patient's characteristics of the primary tumor and the location of clinically diagnosed lymph node metastases.• Similar to the dataset by Ludwig et al. [2] , the three cohorts presented in this work may allow researchers to build and further develop predictive statistical models [4 , 5] that estimate the risk for occult disease of a patient based on their clinical diagnosis.

Data Description
The data is provided as three separate datasets that were collected at different institutions.Each dataset is contained in its own directory in the GitHub repository and each directory and data file is structured in the same way.However, some of the information included is specific to one dataset and not contained in the others.Due to these differences, we describe each file separately for completeness.Each dataset is also indexed on Zenodo as their own separate dataset.

2023-isb-multisite/
• data.csv: The data is provided as a comma separated value (CSV) containing one row for each of the 332 patients.The table has a header spanning three rows that describe the columns.Below we explain each column in the form of a list with three levels.2. Ib_to_III : Total number of dissected lymph nodes found to harbor metastases in the right LNLs Ib-III.Note that this is not just the sum of the dissected nodes in the LNLs Ib to III, because some levels were resected en-bloc.Those are included in this column but could not be resolved for the individual LNLs.9. enbloc_dissected : These columns only report the number of lymph nodes that were resected en-bloc.If, e.g., the LNLs II, III, and IV were resected together, then in each of the respective columns, we report the total number of jointly resected lymph nodes and add a symbol -e.g.'a' -to identify the en-bloc resection group.1. left : Number of en-bloc resected nodes on the left side per LNL. 1. < LNL > : Number of lymph nodes resected together that included this level.2. right: En-bloc resected lymph node count for the right side of the neck.
1. < LNL > : Indicates the number of lymph nodes in the group that included this LNL.10. enbloc_positive : These columns are structured in the same way as under the key en-bloc_dissected, but report the number of lymph nodes that were pathologically involved.Again, the number found in a particular column reports the number of metastatic lymph nodes found in the jointly resected group the respective LNL was part of.LNLs that were resected together share an appended symbol (e.g., "8a").1. left : Number of en-bloc resected nodes on the left side per LNL that harbored metastasis. 1. < LNL > : Number of lymph nodes resected together and found to be involved that included this level.2. right : En-bloc resected lymph node metastasis count for the right side of the neck.
1. < LNL > : Indicates the number of lymph positive nodes in the group that included this LNL.

Data characterization
Fig. 1 illustrates what type of data is provided through this publication and its value for personalizing the risk of lymph node involvement.It shows the percentage of patients with involvement in ipsilateral LNL III for oropharyngeal and oral cavity SCC patients.Previous publications have reported the prevalence of LNL involvement for these tumor locations [6] corresponding to just the first bar in the two panels of Fig. 1 .The detailed per-patient and per-level reporting in our datasets allows for stratifying patients according to different risk factors that impact the probability of level III involvement.For example, stratifying oropharynx patients based on involvement of ipsilateral level II shows that relatively few patients have metastases in level III if level II is healthy (24 out of 115 patients, 21%).For patients with metastases in level II, involvement of level III is much more common (90 out of 279 patients, 32%).For oral cavity tumors, the difference in level III involvement depending on presence of metastases in level II is even more pronounced.Similarly, patients can be stratified based on primary tumor subsite, which is illustrated for oral cavity SCC.For tumors located in the gums or cheeks only 3 out of 91 patients have metastases in level III.Instead, for tumors located in the tongue, 26 out of 158 patients have metastases in level III.Figs. 3 , 4 , and 5 display the distribution over primary tumor subsites in the three cohorts.The datasets show a similar but not the same distribution across oral cavity and oropharynx subsites.Among oropharynx SCC, tumors located in the tonsil are the most common in all three datasets.Among oral cavity subsites, tumors in the tongue are the most common in both the ISB and CLB Multisite datasets, however, some differences in the distribution are observed.Most notably, the CLB dataset contains a much larger number of larynx SCC.Thus, the figures suggest some differences in patient referral and selecting patients for surgical treatment between the two centers.

2023 ISB multisite dataset 4.1.1. Patient cohort
The dataset contains 332 patients with newly diagnosed head and neck SCC who received a neck dissection at the ENT department at ISB between 2010 and 2019.Patients treated with definitive (chemo)radiotherapy to the neck are thus not included in this dataset.Other exclusion criteria are prior treatment to the neck, prior malignancy above the diaphragm, localizations other than those defined of the primary tumor, and skin tumors.The treatment modality was determined individually for each patient, taking into account all available information during the interdisciplinary tumor board.In this context, the extent (side, levels to be removed) of the neck dissection was also determined.

Pathological lymph node involvement
For most patients, dissected lymph node levels were sent to pathology separately.In that case, the pathologist reported the number of investigated and the number of positive lymph nodes per lymph node level.In some cases, neighboring LNL were resected en bloc and not marked individually such that positive lymph nodes could not be uniquely assigned to one LNL.This decision was made by the surgeon during surgery based on the clinical presentation.In many such cases, there was a multilevel involvement of a large metastasis or a conglomerate.In this case, the total number of investigated and positive lymph nodes in the jointly resected levels are reported.Usually, the pathologist also reported the largest lymph node affected by a metastasis and the status of the extracapsular extension (ECE).

Clinical lymph node involvement
Clinical involvement information is based on diagnostic imaging (CT, MRI, PET/CT) acquired during routine clinical care using standard criteria for considering a lymph node as metastatic as described in Biau et al. [3] .The analysis was performed by a specialized head and neck radiologist as part of the interdisciplinary tumor board and thus included in the dataset.

Patient and primary tumor characteristics
The clinical data and information reported in the dataset were taken from the medical records, especially from the tumor board report, which is a consolidation of the available information (medical history, clinical examination, panendoscopy report, results of radiological and pathological examinations).The tumor location, the tumor stage according to the TNM system (7th edition), [7] the relation of the tumor to the midline, HPV status, smoking and alcohol consumption are reported.

2023 CLB multisite dataset 4.2.1. Patient cohort
The dataset contains 373 patients with newly diagnosed head and neck SCC who received a neck dissection at the CLB between 2003 and 2018.Patients treated with definitive (chemo)radiotherapy are thus not included in this dataset.Other exclusion criteria include recurrent tumors or prior treatment of the neck possibly affecting lymphatic drainage.Oropharyngeal SCC patients included in the 2021 CLB oropharynx dataset are not included here, i.e., no patient is duplicated.The extent of neck dissection varied between patients.In almost all patients, ipsilateral levels II, III, and IV were resected; ipsilateral level I was resected in two thirds of patients.Contralateral levels II, III, and IV were resected in approximately half the patients.

Pathological lymph node involvement
Dissected lymph node levels were sent to pathology separated by level.The number of investigated and the number of positive lymph nodes per lymph node level was extracted from the pathology report.In addition, it was recorded whether or not ECE was present anywhere in the surgical specimens, but it was not recorded in which levels the ECE was located.

Clinical lymph node involvement
Clinical involvement was not recorded for this dataset.Based on patterns of care, it is assumed that all lymph node levels that were not resected were clinically node negative.

Patient and primary tumor characteristics
The clinical patient information and tumor location and staging were taken from the medical records.The tumor location was reported as ICD-O-3 codes; the tumor stage according to the TNM system was based on the 7th edition [7] or 8th edition.In addition, smoking history and alcohol consumption are reported, as well as HPV status if available.

2021 CLB oropharynx dataset
The dataset contains 263 patients with newly diagnosed oropharyngeal SCC.All patients received neck dissection at the CLB and two collaborating hospitals.The dataset is underlying the a previous publication by Bauwens et al. [1] investigating differences in lymph node involvement between HPV associated oropharyngeal SCC and HPV negative tumors.Details on how the data was collected are described in the original research article.

Limitations
While all three provided tables and the rows representing individual patients conform to the same format, they do not all contain the same amount of detail.E.g., in some patients clinical involvement was not separately reported per imaging modality (MRI, CT, …) but as a consensus decision.As another example, not all patients were staged according to the same TNM edition (the used edition is always stated).Also, there are missing values for some patients and some columns.For example, sometimes several LNLs were resected en-bloc during neck dissection, and it was thus not possible to infer which LNL surely must have harbored metastases.The lateralization of the primary tumor, too, is not reported for parts of the cohort.The binary encoding of the reported data itself represents a limitation, as it discards information on the amount of disease and its precise location within an LNL.And lastly, the data may not be representative of the typical distribution of HNSCC patients, both in terms of their primary tumor location and their disease advancement.This is because all patients in these three cohorts have been treated with some form of neck dissection, which is not necessarily performed in all HNSCC patients.

Fig. 1 .
Fig.1.Stacked bar plot reporting the percentage of patients with ipsilateral LNL III involvement found in the three datasets, for oropharyngeal SCC (top panel) and oral cavity SCC (bottom panel).Patients are then stratified by whether LNL II was involved, and by the tumor subsite in the oral cavity case.For each subgroup of patients, the distribution over T-category is shown color-coded.For this plot, all three datasets have been combined.Involvement refers to pathological involvement for most patients (and clinical involvement if levels II or III were part of en bloc resections, such that pathologically positive lymph nodes could not be uniquely assigned to a level).

Fig. 2 .
Fig. 2. Distributions over T-category in the three datasets, visualized as pie charts.The displayed datasets are -from left to right -the 2023-ISB-multisite data, the 2023-CLB-multisite dataset, and the 2021-CLB-oropharynx cohort.

Fig. 3 .
Fig. 3. Distribution over primary tumor subsite in the 2023 ISB multisite dataset.For each location, the ICD-O-3 codes that are grouped together are indicated in the figure.E.g., "base of tongue" refers to all patients with an ICD-O-3 codes "C01" or "C01.9".

Fig. 2
Fig. 2 characterizes the three patient cohorts in terms of the distribution over T-category.Comparing the ISB and CLB Multisite datasets in terms of T-category shows that CLB dataset is shifted towards more advanced T-category.Compared to the ISB dataset, the CLB data contains fewer T1 tumors and more T4 tumors.Figs.3, 4, and 5 display the distribution over primary tumor subsites in the three cohorts.The datasets show a similar but not the same distribution across oral cavity and oropharynx subsites.Among oropharynx SCC, tumors located in the tonsil are the most common in all three datasets.Among oral cavity subsites, tumors in the tongue are the most common in both the ISB and CLB Multisite datasets, however, some differences in the distribution are observed.Most notably, the CLB dataset contains a much larger number of larynx SCC.Thus, the figures suggest some differences in patient referral and selecting patients for surgical treatment between the two centers.

Fig. 4 .
Fig. 4. Distribution over primary tumor subsite in the 2023 CLB multisite dataset.For each location, the ICD-O-3 codes that are grouped together are indicated in the figure.E.g., "base of tongue" refers to all patients with an ICD-O-3 codes "C01" or "C01.9".

Fig. 5 .
Fig. 5. Distribution over primary tumor subsite in the 2021 CLB multisite dataset.For each location, the ICD-O-3 codes that are grouped together are indicated in the figure.E.g., "base of tongue" refers to all patients with an ICD-O-3 codes "C01" or "C01.9".
Whether the patient was considered a smoker.This is set to False, when the patient had zero pack-years 8. hpv_status : The p16 status of the patient as a surrogate marker for HPV associated tumors.9. neck_dissection : Whether the patient underwent a neck dissection.In this dataset, all patients underwent a neck dissection.10. tnm_edition : The edition of the TNM classification used.11. n_stage : The pN category of the patient (pathologically assessed).12. m_stage : The M category of the patient.13. extracapsular : Whether the patient had extracapsular spread in any LNL.2. tumor : This top-level header contains general tumor information.1. 1 : This second-level header enumerates synchronous tumors.No patient in this cohort had synchronous tumors.Ib_to_III : Total number of dissected lymph nodes in the left LNLs Ib-III.Note that this is not just the sum of the dissected nodes in the LNLs Ib to III, because some levels were resected en-bloc.Those are included in this column but could not be resolved for the individual LNLs.3. right : Number of dissected lymph nodes per LNL on the right side. 1. < LNL > : Total number of dissected lymph nodes in the right LNL .2. Ib_to_III : Total number of dissected lymph nodes in the right LNLs Ib-III.Note that this is not just the sum of the dissected nodes in the LNLs Ib to III, because some levels were resected en-bloc.Those are included in this column but could not be resolved for the individual LNLs.8. positive_dissected : This top-level header contains information about the number of pathologically positive lymph nodes in each LNL. 1. info : This second-level header contains general information about the findings of metastasis by the pathologist.1. date : Date of the neck dissection.2. all_lnls : The total number of investigated lymph nodes that were found to harbor metastatic disease across all LNLs.Because during some neck dissections multiple LNLs were resected and sent to the pathologist together, this entry may report more investigated LNLs than the sum of each LNL entry separately.3. largest_node_mm : Size of the largest lymph node in the neck dissection in mm. 4. largest_node_lnl : LNL where the largest pathological lymph node metastasis was found.2. left : Number of pathologically positive lymph nodes per LNL on the left side.1. < LNL > : Number of pathologically positive lymph nodes in the left LNL .2. Ib_to_III : Total number of dissected lymph nodes found to harbor metastases in the left LNLs Ib-III.Note that this is not just the sum of the dissected nodes in the LNLs Ib to III, because some levels were resected en-bloc.Those are included in this column but could not be resolved for the individual LNLs.3. right : Number of pathologically positive lymph nodes per LNL on the right side.
So, for example, list entry 1.1.7refersto a column with the three-level header "patient | # | nicotine_abuse" and this column reports about the patient's smoking status: 1. patient : This top-level header contains general patient information.1.# : The second level header for the patient columns is only a placeholder.1.id : The local study ID. 2. institution : The institution where the patient was treated.3.sex : The biological sex of the patient.4.age : The age of the patient at the time of diagnosis.5.diagnose_date : The date of diagnosis.6.alcohol_abuse : Whether the patient was abusingly drinking alcohol at the time of diagnosis.7.nicotine_abuse : 1. location : The location of the tumor.2.subsite : The subsite of the tumor, specified by ICD-O-3 code.3.side : Whether the tumor occurred on the right or left side of the mid-sagittal plane.4.central : Whether the tumor was located centrally or not. 5. extension : Whether the tumor extended over the mid-sagittal line.6.volume : The volume of the tumor in cm ^ 3.2.1. < LNL > : Number of pathologically positive lymph nodes in the right LNL .