[Python] 파일에서 특정 단어 개수 세기

텍스트 파일에서 특정 단어가 몇개 있는지 개수를 세는 방법에 대해서 알아보겠습니다.

1. read()를 이용한 방법
2. readlines()를 이용한 방법
3. 정규표현식을 이용한 방법

1. read()를 이용한 방법

예제에서 사용하는 sample.txt 파일은 아래와 같습니다.

Hello, World, Python!
안녕하세요, Hello
Example, Hello, Sample

file.read()는 파일의 모든 텍스트를 읽어서 문자열로 리턴합니다. 아래와 같이 count 함수를 사용하여 특정 문자열 개수를 셀 수 있습니다.

file.read() : 파일의 모든 내용을 문자열로 리턴
string.count(word) : string에서 word 개수 리턴

file_path = 'sample.txt'
target_word = 'Hello'

with open(file_path, 'r') as file:
    file_contents = file.read()
    word_count = file_contents.count(target_word)

print(f"'{target_word}'의 개수: {word_count}")

Output:

'Hello'의 개수: 3

2. readlines()를 이용한 방법

readlines()는 파일의 텍스트를 라인 단위로 읽어서 리스트로 리턴합니다. 리스트를 순회하면서 특정 단어를 찾아 개수를 셀 수 있습니다.

lines = file.readlines() : 파일의 텍스트를 라인 단위로 리스트로 리턴
line.count(target_word) : 문자열에서 특정 단어 개수 세기

file_path = 'sample.txt'
target_word = 'Hello'

with open(file_path, 'r') as file:
    lines = file.readlines()
    word_count = 0
    for line in lines:
        word_count = word_count + line.count(target_word)

print(f"'{target_word}'의 개수: {word_count}")

Output:

'Hello'의 개수: 3

3. 정규표현식을 이용한 방법

read()로 파일의 모든 내용을 문자열로 가져오고, 문자열에서 정규표현식으로 특정 문자열 개수를 셀 수 있습니다.

\b 패턴은 정규표현식에서 단어의 경계를 의미함
re.escape(str)은 str이 정규표현식이 아닌, 단순 문자열이라는 의미
r'\b' + re.escape(target_word) + r'\b' : 띄어쓰기로 분리되어있는 target_word를 찾는 패턴
pattern.findall(file_contents)는 문자열에서 패턴과 일치하는 모든 것을 찾아서 리스트로 리턴. len()으로 찾은 개수를 리턴할 수 있음

import re

file_path = 'sample.txt'
target_word = 'Hello'

with open(file_path, 'r') as file:
    file_contents = file.read()
    pattern = re.compile(r'\b' + re.escape(target_word) + r'\b', re.IGNORECASE)
    word_count = len(pattern.findall(file_contents))

print(f"'{target_word}'의 개수: {word_count}")

Output:

'Hello'의 개수: 3

참고로, 위의 pattern.findall(file_contents)만 print로 출력해보면 아래와 같이 출력됨

['Hello', 'Hello', 'Hello']