Indonesian Air Force Physical Tester Reliability in Assessing One-Minute Push-Up, Pull-Up, and Sit-Up Tests

The physical fitness test is a form of assessment to determine the level of physical fitness of a person, both general and specific (muscle). The purpose of this study was to assess the correlation among testers on pull-ups, sit-ups, and push-ups for one minute, and to determine the lowest reliability of the three tests. This study uses a sample of five people who are physical fitness testers of the Indonesian Air Force (TNI AU) who are experienced and active in conducting tests. The subjects were 25 males 18–22 years old. All testers assessed each subject by recording the results of the repetition of movements on all three tests. The data obtained were then converted based on the Indonesian Air Force physical fitness technical guidelines book. After analysis with Anova and ICC, it was found that the data produced by the five different testers had the ICC coefficient values that varied the least on the push-up test. Increased reliability of the testers can be accomplished through practice, tester selection, and paying attention to the ability of the tester. Also, the development of assessment tools and the development of alternative forms of testing are needed.


Introduction
Physical fitness has an essential role in supporting one's physical activities so that they can carry out their duties optimally. The degree of physical fitness has a linear relationship with the level of achievement, work success, and other physical activities (Widiyanto & Hartono, 2018). Many institutions require a certain level of fitness, so systems and tools that can measure and assess someone's fitness level ae needed. Harsono (2015) argues that physical fitness components that can be measured and assessed include strength, endurance, muscular power, speed, flexibility, agility, coordination, balance, accuracy, and reaction. In measuring physical fitness, the aspects that must be measured are the basic motor skills, which include strength, endurance, speed, flexibility, and coordination (Bompa & Haff, 2009).
Muscle strength and endurance are essential components of physical fitness (McManis, Baumgartner, & Wuest, 2000). The level of strength and endurance of muscles affects the ability of individuals to perform daily functions and various physical activities. A physical fitness test is needed to produce data about physical abilities, both in monitoring the physical development of coaching and in the context of selection. In the Indonesian Air Force, one-minute model pull-ups, push-ups, and sit-ups are part of a form of physical fitness test conducted to determine the strength and endurance of muscles without using assistive devices (Hartono, Widodo, Wismanadi, & Hikmatyar, 2019). Pull-up and push-up tests are used to assess and develop the strength of the shoulders, arms, and upper body, while sit-up tests are used to measure the strength and endurance of the abdominal muscles (Fox, 1988;Piscopo & INDONESIAN AIR FORCE PHYSICAL TESTER RELIABILITY | S. ARIFIN ET AL. Baley, 1981;TNI AU, 2011;TNI AU, 2013).
The results of these three tests are based on the results of the tester, who interprets the pull-ups, sit-ups, and push-ups. Based on the technical guidance of the Indonesian Air Force soldiers' physical safety test, the pull-up movement is done by lifting the body with the strength of the arm so that the chin passes above the bar and then drops off to the starting posture followed by lifting the body; this is repeated as much as possible without resting for a maximum of one minute. In the sit-up test, in the initial stance, participants lie on their backs with their legs bent 90 degrees, their feet flat against the floor and knees approximately 20 cm apart, hands placed behind the head, fingers placed with legs held in place to keep them from moving.
The movement starts with rising and sitting and bending down until the nose touches the right or left knee, and one of the elbows is between the knees; the subjects then quickly goes down, lying on his back as in the starting posture, and then repeats the motion for a maximum of one minute.
In the push-up test, the starting position is with both hands under the shoulders, arms bent at the side of the body, legs straight with toes resting on the floor and the distance between the hands as wide as the body. The subject straightens his arms to lift the body so that it is raised with the legs and body straight; He then bends his arms so that the body lowers; his chest touches the floor while his stomach should not; the head turned to the right or left, and the movement is repeated for one minute.
In connection with a test, according to Miller (2002), some physical ability test requirements are valid, reliable, objective, economical, attractive, and should be implemented. The study of McManis et al. (2000) reveals some problems that are often encountered in the pull-up test when test participants take a down position, and it is difficult to make an assessment. Likewise, in the push-up test, many assessors have difficulty in determining the movements, so the measurement results differ between assessors. According to Barnett et al. (2009), their research showed some problematic motor skills to be assessed, which result in a low-reliability score; they highlighted some obstacles in determining the reliability values in field-based research with direct observation rather than research that uses assistive devices and is more controlled.
In contrast, Baumgartner and Gaunt (2005), in their research on push-up movements, stated that the problem in push-up tests was to determine the position so that the tester could assess accurately. The tester must decide on an assessment of whether the movement of the tester is the correct movement and results in getting a score. The position of the part of the body determines the movements performed, including the number of times the movement can be repeated. This is consistent with what was stated by Cogley et al. (2005) in their research on free movement that different hand positions in push-ups affect the results -ups.
It is crucial that the tester can carry out measurements to produce accurate test data. In tests, errors in measurement are difficult to avoid, so what the tester can do is to anticipate the smallest possible error. The implementation of a mass test with a large number of participants requires many testers to be involved so that there may be no similarities between the measurement and assessment data. Tests involving a large number of testers must pay attention to the agreement between the testers (Putranta & Supahar, 2019). Putranta and Supahar's research (2019), shows that when the total score resulting from inter-assessor measurements and the results of the appraiser's agreement is examined, the scores are almost always not identical. Kozlowski and Hattrup (1992) define agreement as interrater consensus and reliability as interrater consistency. One way to determine the ability of a tester to take measurements and assessments compared to other testers is called reliability inter-rater. There are many ways to obtain the value of the reliability coefficient inter-rater (ICC), but the basic technique is based on analysis of variance and estimation of various components of variance (Bartko, 1966). The ICC approach is used to assess the consistency of measurements made by several testers on test-takers. Various indices to measure the agreement between several assessors regarding the presence or absence of different measurement results can be interpreted as an intra-class correlation coefficient (Rae, 1984). Fielitz, Coelho, Horne, and Brechue (2016) found that the coefficient among raters on a two-minute push-up test was small, which is also in line with Mathews' (2013) research on the reliability of rater in pull-up and push-up tests, which states that the results of this experiment illustrate the fact that the ability of the rater to measure physical fitness index is not better when carried out alternately or simultaneously. Also, the learning factor certainly helps to calculate a more valid score, so measurement needs to be preceded by training.
This study aims to obtain the level of reliability of the Indonesian Air Force physical testers among testers one-minute push-up, pull-up, and sit-up tests, and to determine the lowest reliability of three tests. Futhermore, the aims of this study also was used material for correction, training, and guidance in testing in the future.

Methods
Respondents in this study consisted of 25 young male civilians and 18-25-year-old male students who were part of the physical fitness development group at Adi Sucipto Air Force Base, Yogyakarta, Indonesia. As many as five randomly selected testers came from the Air Force Physical Development unit and were experienced and often involved in physical fitness testing. An assessor is an active military member who is male and aged 25-50 years and is still actively involved in physical fitness testing in the Indonesian Air Force.
All procedures for carrying out a pull-up, sit-up, and pushup are guided by technical guidelines for physical fitness tests issued by the Indonesian Air Force Headquarters. Participants carry out pull-up for one minute alternately in the order given by the assessor. During the test, each subject is rated by five testers. The tester only assesses the correct movements performed by the subject for one minute. If the participant stops even though one minute has not expired, the test is considered complete, and the tester records the results obtained. The same procedure is also done on sit-ups and push-ups, with the same subject, but before carrying out the next test, the subject is given sufficient rest time.
The data generated in the form of the results achieved by the subject for one minute of each type of test based on the results of the number of times able to make the correct movements in each test that has been recorded by the tester is then converted to the value of the ability to perform the exercises according to the assessment table contained in the manual for physical fitness test of the Indonesian Air Force on a scale of 0-100. Then the converted value is processed by Anova and ICC analysis with the SPSS software.

Results
Descriptive data analysis results obtained that the five tes-ters have different ratings on the results of the pull-up, sit-up, and push-up assessment. The results of the assessment by the five testers are in the form of the average value and the complete standard deviation in Table 1. In the pull-up test, the fifth tester has the most substantial average rating with an average value 59.00±23.255. The smallest assessment results, with an average 35.04±22.79, were obtained from the second tester. The results of the situp assessment also show almost the same results, namely the five testers have a diversity of test results for which the largest average is obtained from the fifth tester rating with an average of 86.20±11.99 while the smallest assessment with an average of 69.80±18.07 obtained from the second tester.
In the push-up test, the most significant average rating is obtained from the fifth tester with an average 53.56±22.74, while the smallest assessment with an average 32.64±25.96 obtained from the fourth tester. From these data, it appears that the fifth tester tends to give a high rating compared to other testers, and the second tester tends to give a low rating. Differences in the results from the four testers above can also be proven through Anova analysis, as presented in the following Table 2.  Table 2 shows that all the results of the assessment of the five testers through three types of tests differ significantly with the calculated F value greater than the F critical and the significance value p=0.0000. In the pull-up test, F count=17.407, the sit-up test F value=28.174 and push-up test F value=12.239 all of which showed a value greater than F critical=2.87. Relating to the level of reliability of the tester in noncritical assessments on a pull-up, sit-up, and push-up tests, the magnitude of the correlation values among testers through the Inter Correlation Class analysis can be seen in Table 3. The results of the calculation of correlations among testers in Table 3 use the ICC type of consistency approach, which emphasizes the similarity of ratings between testers. This type of approach is suitable if used to measure abilities that emphasize the differences in each subject and the achievement of predetermined criteria. Table 3 data shows that in the three types of tests, between testers have varying correlation coefficient values: the pull-up test with the ICC coefficient = 0.782 with the correlation range 0.657-0.882, the sit-up test ICC coefficient = 0.868 with the correlation range 0.782-0.931 and the push-up test with the ICC coefficient = 0.706 with a correlation range of 0.556-0.835.

Discussion
Pull up, sit-up, and push-up tests are essential components of physical fitness, especially muscle strength and endurance.
The equipment that is used is as simple as a crossbar for pullup tests while none is needed for the sit-up and push-up tests. Another consideration is that it can be used to test many participants within a limited period, such as tests at military institutions with many test subjects. This component is important for someone who engages in many physical activities, especially muscle strength and endurance, such as athletes and soldiers. According to D'Isanto et al. (2019), the assessments produced through tests serve to define the anthropometric and psychomotor profiles of a person who is used to help determine the goals needed to set a training programme.
Accurately assessing the three tests is difficult because the focus is to obtain as many results as possible with a one-minute repetition of movements. Circumstances with rapid repetition of such movements would certainly make it difficult for the tester to be able to judge carefully and produce accurate data. The assessment results from several testers appear to vary, including within the same test. Analysis based on the variance values of the above results leads to the conclusion that there are differences in the assessment made by the five testers who have a high significance value with p=0.0000.
While the variation in the value of the inter-rater correlation coefficient shows that the inter-rater correlation value on the push-up test has the smallest value with the value of ICC = 0.687, but the ICC value of pull-up and sit-up tests has values > 0.8. This study also obtained that the range of correlation coefficient values of the five testers in each test has a fairly long range, so the reliability of the tester can be concluded not yet fully adequate. Koo and Li (2015) stated that the ICC coefficient value below 0.50 is bad, between 0.50 and 0.75 in the medium category, between 0.75 and 0.90 the good category and above 0.90 is excellent.
Meanwhile, Artero, España-Romero, and Castro-Piñero (2011) suggested that the ICC between 0.70-0.80 is still questionable or doubtful, and 0.90 is considered high. Thus the reliability between testers on the pull-up, sit up and push up tests needs to be improved. Bajpai, Bajpai, and Chaturvedi (2015) state that it is essential to realize that it is not possible to reach a perfect agreement between testers and that a professional tester and experience are needed to obtain high coefficient values between them. There are many concrete steps to improve the consistency of the assessment by the tester and increase the value of the ICC coefficients of multiple testers, namely through the training of assessors, the selection of appraisers, and the ability to judge. Several studies have been carried out to improve the reliability of testers in carrying out physical tests.
In the study of McCunn et al. (2017), it is stated that filming aids in realizing a better agreement among the judges. A similar study by Mischiati et. Al. (2015) shows that the reliability value between assessors is acceptable. Rogers et al. (2017) demonstrated that ICC scores are better produced by assessors who use video assistance compared to direct measurement.
Further research needs to be done to develop tester tools in assessing pull-up, sit-up and push-up tests, such as the use of infrared motion sensors and cameras. Alternative forms of testing to measure muscle strength and power while still paying attention to practicality in the implementation of tests with a large number of participants are needed. Tester training, the selection or selection of testers involved in the test and paying attention to the ability of the tester are ways of improving the reliability of the tester. It is necessary to develop assistive devices used to pull up, sit up, and push tests to assist the tester in making decisions in assessing the correct movements, especially for tests with a large number of subjects.
Assessments made by multiple testers on a pull-up, sit-up, and push-up tests that rely on humans in their measurements are prone to differences in measurement. This can be seen from the test results data that are varied or significantly different with a significance of P = 0.000. This study also shows the differences in the level of reliability of the tester value of the correlation coefficient among tester (ICC), for which the reliability of the tester on the exercises are all in the medium range; therefore, for use in military institutions, it is necessary to make efforts to improve reliability. In addition to these, other alternatives are needed, such as the use of other forms of testing, as well as the use of assistive devices to facilitate measurement and training for the tester.