[03 OCR] - 영수증 데이터 추가 방법

728x90

1. EAST 모델 기반 영수증 텍스트 영역 검출 프로젝트

프로젝트는 EAST 모델을 기반으로 영수증의 텍스트 영역을 더욱 정밀하게 검출하는 데 중점을 두고 있다. 목표는 텍스트 영역 검출의 precision과 recall을 최대한 높여, 영수증 내 텍스트 검출 성능을 크게 향상시키는 것이다. EAST 모델 자체의 설명은 후반부에서 다루기로 하고, 먼저 기존에 부족했던 데이터셋을 보완하기 위해 영수증 데이터를 추가하는 과정을 공유하겠다.

영수증 데이터셋이 제한적이었기에, 텍스트 검출의 정확도와 성능 향상에 어려움이 있었다. 이를 해결하기 위해 추가 데이터를 활용하여 train/val 세트를 확장하였고, 그 결과 텍스트 검출의 precision 와 recall 모두 긍정적인 성능 향상을 확인할 수 있었다. 데이터 양이 증가함에 따라, 텍스트 영역 검출에서 높은 성능을 달성할 수 있었다는 점이 중요한 성과로 남는다.

이 글에서는 데이터 추가 과정에서의 구체적인 방법론 등을 자세히 소개할 예정이다.

2. Hugging Face CORD 데이터 추가 및 변환 과정

1. CORD 데이터 다운로드

우선 Hugging Face 사이트에서 CORD 데이터셋 페이지로 이동한다. 여기서 Files and versions 탭에 들어가 data 폴더로 접속하면 필요한 데이터셋을 확인할 수 있다. 원하는 데이터 파일을 선택하여 다운로드할 수 있으며, 다운로드한 파일의 확장자는 .parquet 형식이다.

naver-clova-ix/cord-v2 · Datasets at Hugging Face

{"gt_parse": {"menu": [{"nm": "Nasi Campur Bali", "cnt": "1 x", "price": "75,000"}, {"nm": "Bbk Bengil Nasi", "cnt": "1 x", "price": "125,000"}, {"nm": "MilkShake Starwb", "cnt": "1 x", "price": "37,000"}, {"nm": "Ice Lemon Tea", "cnt": "1 x", "price": "24

huggingface.co

Parquet 형식은 Apache에서 개발한 컬럼 기반의 파일 저장 형식으로, 주로 대규모 데이터 저장과 분석에 최적화되어 있다. 컬럼 단위로 데이터를 저장하기 때문에, 특정 컬럼에만 접근해도 되어 효율적이고 빠른 데이터 읽기가 가능하다. 이는 특히 빅데이터 분석 작업에서 유리하며, 저장 공간도 절약할 수 있다. 또한 다양한 데이터 타입을 지원하고, 압축 기능이 포함되어 있어 대량의 데이터를 저장할 때 유용하다.

2. UFO 형식 변환 작업

CORD 데이터셋은 .parquet 형식으로 되어 있으므로, 이를 UFO 형식으로 변환하는 작업이 필요하다. UFO 형식은 기존에 작업 중이던 프로젝트와 호환되기 때문에, 데이터 전처리를 통해 일관된 형식으로 맞추는 것이 중요하다. 변환 과정에서는 .parquet 파일을 파싱하여 UFO 형식에 맞게 변환하는 코드는 아래와 같다.

1) Parquet to JSON

우선 parquet 파일 형식을 json 파일로 변환이 필요하다.

from PIL import Image
from io import BytesIO
import json
import pandas as pd
import os


# `parquet` file
parquet_files = [
    './data/origin/train-00000-of-00004-b4aaeceff1d90ecb.parquet',
    './data/origin/train-00001-of-00004-7dbbe248962764c5.parquet',
    './data/origin/train-00002-of-00004-688fe1305a55e5cc.parquet',
    './data/origin/train-00003-of-00004-2d0cd200555ed7fd.parquet',
]

# dir json save
image_dir = './data/convert_data/cord_images'
json_dir = './data/convert_data/cord_json'
os.makedirs(image_dir, exist_ok=True)
os.makedirs(json_dir, exist_ok=True)

# index
image_index = 1

for num, file_path in enumerate(parquet_files):
    df = pd.read_parquet(file_path)

    for index, row in df.iterrows():
        # image save
        image_data = row['image']['bytes']
        image = Image.open(BytesIO(image_data))
        image_file_path = f'{image_dir}/image_{image_index}.jpg'
        image.save(image_file_path)

        # ground_truth save
        ground_truth_str = row['ground_truth']
        ground_truth_dict = json.loads(ground_truth_str)

        # JSON save
        with open(f'{json_dir}/image_{image_index}.json', 'w', encoding='utf-8') as json_file:
            json.dump(ground_truth_dict, json_file, ensure_ascii=False, indent=4)

        image_index += 1

print(f"Images saved in: {image_dir}")
print(f"JSON files saved in: {json_dir}")

2) JSON to COCO

UFO 형식으로 변환하기 전에 COCO 데이터 형식으로 변환이 필요하다.

COCO는 이미지의 객체 검출을 위한 주석 형식으로, 카테고리, 바운딩 박스, 세그먼트 등 다양한 정보가 포함된다.

import datetime
from typing import Dict
import json
import os

json_folder = f'./data/convert_data/cord_json/'
output_path = f'./data/convert_data/cord_json_convert_coco.json'

# 기존 정보
info = {
    'year': 2024,
    'version': '1.0',
    'description': 'OCR Competition Data',
    'contributor': 'Naver Boostcamp',
    'url': 'www',
}
licenses = {
    'id': '1',
    'name': 'For Naver Boostcamp Competition',
    'url': None
}
categories = [{'id': 1, 'name': 'word'}]

# COCO 데이터 초기화
img_id = 1
annotation_id = 1
images = []
annotations = []

file_names = os.listdir(json_folder)
sorted_file_names = sorted(file_names, key=lambda x: int(x.split('_')[1].split('.')[0]))  # 파일 정렬

for file_name in sorted_file_names:
    if file_name.endswith('.json') and img_id:
        input_path = os.path.join(json_folder, file_name)

        with open(input_path, 'r') as f:
            file = json.load(f)
        image = {
            'id': img_id,
            'width': file['meta']['image_size']['width'],
            'height': file['meta']['image_size']['height'],
            'file_name': f'image_{img_id}.jpg',
            "license": 1,
            "flickr_url": None,
            "coco_url": None,
            'data_captured': None
        }
        images.append(image)

        for ann_info in file['valid_line']:
            for word_info in ann_info['words']:
                quad_info = word_info['quad']
                x1 = quad_info['x1']
                y1 = quad_info['y1']
                x2 = quad_info['x2']
                y3 = quad_info['y3']

                # COCO 형식으로 bbox 좌표 계산
                min_x = x1
                min_y = y1
                width = x2 - x1
                height = y3 - y1

                segmentation = [
                    [min_x, min_y, min_x + width, min_y, min_x + width, min_y + height, min_x, min_y + height]
                ]

                coco_annotation = {
                    "id": annotation_id,
                    "image_id": img_id,
                    "category_id": 1,
                    "segmentation": segmentation,
                    "area": width * height,
                    "bbox": [min_x, min_y, width, height],
                    "iscrowd": 0,
                    'tags': ['Auto']
                }
                annotations.append(coco_annotation)
                annotation_id += 1

        img_id += 1

# make coco
coco = {
    'info': info,
    'images': images,
    'annotations': annotations,
    'licenses': licenses,
    'categories': categories
}

# JSON 파일로 저장
with open(output_path, 'w') as f:
    json.dump(coco, f, indent=4)

3) COCO to UFO

UFO 형식은 주로 OCR과 관련된 프로젝트에서 사용하는 주석 형식으로, 텍스트 영역과 그에 대응하는 정보가 포함된다. 따라서 COCO에서 텍스트 영역이나 바운딩 박스 정보를 UFO 형식에 맞게 변환하여 저장해야 한다.

import datetime
from typing import Dict
import json

now = datetime.datetime.now()
now = now.strftime('%Y-%m-%d %H:%M:%S')

# coco to ufo
input_path = f'./data/convert_data/cord_json_convert_coco.json'
output_path = f'./data/convert_data/cord_json_convert_ufo.json'

ufo = {
    'images': {}
}


def coco_bbox_to_ufo(bbox):
    min_x, min_y, width, height = bbox
    return [
        [min_x, min_y],
        [min_x + width, min_y],
        [min_x + width, min_y + height],
        [min_x, min_y + height]
    ]


def coco_to_ufo(file: Dict, output_path: str) -> None:
    anno_id = 1
    for annotation in file['annotations']:
        file_info = file['images'][int(annotation['image_id'])-1]
        image_name = file_info['file_name']
        if image_name not in ufo['images']:
            anno_id = 1
            ufo['images'][image_name] = {
                "paragraphs": {},
                "words": {},
                "chars": {},
                "img_w": file_info["width"],
                "img_h": file_info["height"],
                "tags": ["autoannotated"],
                "relations": {},
                "annotation_log": {
                    "worker": "",
                    "timestamp": now,
                    "tool_version": "LabelMe or CVAT",
                    "source": None
                },
                "license_tag": {
                    "usability": True,
                    "public": False,
                    "commercial": True,
                    "type": None,
                    "holder": "Upstage"
                }
            }

        ufo['images'][image_name]['words'][str(anno_id).zfill(4)] = {
            "transcription": "",
            "points":  coco_bbox_to_ufo(annotation["bbox"]),
            "orientation": "Horizontal",
            "language": None,
            "tags": ['Auto'],
            "confidence": None,
            "illegibility": False
        }
        anno_id += 1

    with open(output_path, "w") as f:
        json.dump(ufo, f, indent=4)


with open(input_path, 'r') as f:
    file = json.load(f)
coco_to_ufo(file, output_path)

3. 결론

기존 400장의 영수증 데이터셋에 Naver Clova Cord2 영수증 데이터 800장을 추가하여 총 1,200장의 데이터셋을 구성하였다. 추가된 데이터셋은 별도의 리라벨링 없이 train/val 세트로 나누어 학습을 진행하였다. 그 결과, F1 스코어가 기존 0.69에서 0.85로 상승하며 성능이 향상되는 것을 확인할 수 있었다. 이는 데이터 양이 부족해 발생했던 학습 한계를 해결했으며, 충분한 데이터 양이 모델 성능 향상에 중요한 요소임을 확인하였다.

이상입니다. 끝.

감사합니다.

728x90

저작자표시 비영리 변경금지

'딥러닝 (Deep Learning) > [08] - 프로젝트' 카테고리의 다른 글

[04 Segmentation] - PyTorch 에서 메모리 부족할때 해결하는 방법! (Autocast, GradScaler) (0)	2024.11.21
[04 Segmentation] - MMsegmentation 사용법 (11)	2024.11.09
[02 Object Detection] - MMdetection train/val 학습 방법 (6)	2024.10.27
[02 Object Detection] - MMdetection 설치 및 기본사용법 (6)	2024.10.26
[01 ImageNet Sketch] - Augmentation (2)	2024.09.26

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[03 OCR] - 영수증 데이터 추가 방법

1. EAST 모델 기반 영수증 텍스트 영역 검출 프로젝트

2. Hugging Face CORD 데이터 추가 및 변환 과정

1. CORD 데이터 다운로드

2. UFO 형식 변환 작업

3. 결론

'딥러닝 (Deep Learning) > [08] - 프로젝트' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역