[JS] 영어 축약어 관련 유틸리티 함수 모음

참고 내용

영어에서 Contraction(축약/단축형)과 Abbreviation(약어/축약어)는 다른 개념이다.

Contraction: 두 단어를 하나로 줄이기 위해 일부 문자를 생략하고 아포스트로피로 대체한 형태
e.g. I will → I'll, do not → don't
Abbreviation: 단어나 구의 일부 문자만을 사용하여 줄인 형태. 아포스트로피를 사용하지 않는다
e.g. United States → U.S., Doctor → Dr.

축약은 분리 기준에서 제외하는 정규식

단어 문자가 아닌 문자열 기준으로 분리

// 단어 문자가 아닌 문자열과 일치
const NonWordCharPattern = /(\W)/g;
const sentence = "I'll make coffee and I've done my homework.";

// 단어 문자가 아닌 문자열 양쪽에 공백 추가 e.g. '.' → ' . '
const replaced = sentence.replace(NonWordCharPattern, ' $1 ');
// "I ' ll   make   coffee   and   I ' ve   done   my   homework . "

replaced.split(/\s+/);
// ['I', "'", 'll', 'make', 'coffee', 'and', 'I', "'", 've', 'done', 'my', 'homework', '.', '']

\W 메타문자는 공백을 포함한 단어문자(0-9a-zA-Z_)가 아닌 것을 가리킨다
$1은 첫번째 캡처 그룹(소괄호)을 가리킴. 위 예시에선 쉼표 ,와 공백을 한 번씩 참조
'Hello, World'.replace(/(\W)/g, ' $1 ') → 'Hello , World'
- 첫번째 $1 참조값 : , → ,
- 두번째 $1 참조값 : 공백 1개 → 공백 3개

단어 문자가 아니거나, 아포스트로피(')가 아닌 문자열 기준으로 분리

// 단어 문자나 아포스트로피(')가 아닌 모든 문자와 일치
const NonWordCharPattern = /([^\w'])/g;

// 단어 문자나 아포스트로피가 아닌 문자열 양쪽에 공백 추가 e.g. ' ' → '   '
const replaced = "I'll make coffee and I've done my homework.".replace(NonWordCharPattern, " $1 ");
// "I'll   make   coffee   and   I've   done   my   homework . "

replaced.split(/\s+/);
// ["I'll", 'make', 'coffee', 'and', "I've", 'done', 'my', 'homework', '.', '']

/([^\w'])/g : 단어 문자나 아포스트로피(')가 아닌 모든 문자와 일치 (공백, 쉼표 등)
- [] : 문자 그룹. 대괄호에 있는 문자열 중 하나라도 일치하면 매칭
- [^] : 부정 문자 그룹. 대괄호의 시작이 캐럿(^) 일 땐 대괄호에 해당하지 않는 문자열만 매칭
split(/\s+/) : 하나 이상의 연속된 공백을 기준으로 분리
- \s : 공백 문자
- + : 1번 이상 일치
- 'Hello , World'.split(/\s+/) → ['Hello', ',', 'World']
  1. 첫번째 공백으로 분리 후 ['Hello', ', World']
  2. 두번째 공백으로 분리 후 ['Hello', ',' ,'World']

문장내 축약된 단어의 위치 인덱스를 반환하는 findContrIndexes

const ContractionPattern = /\b\w+'\w*\b/;

const findContrIndexes = (arr: string[]) => {
  return arr.reduce((acc: number[], cur, i) => {
    return ContractionPattern.test(cur) ? acc.concat(i) : acc;
  }, []);
};

findContrIndexes([
  "I'll",
  'make',
  'coffee',
  'and',
  "I've",
  'done',
  'my',
  'homework',
  '.',
]);
// 반환값 [0, 4]

I've, can't 등 다양한 축약 케이스를 식별하기 위해 /\b\w+'\w*\b/ 정규식 사용. \w+' 는 아포스트로피(') 기준 앞에 있는 부분이고, \w* 는 뒤에 있는 부분.

\b : 단어 경계(앞 혹은 뒤에 다른 단어 문자가 등장하지 않는 위치)
\w+' : 1개 이상의 연속된 단어 문자 뒤에 아포스트로피가 있는 문자와 일치
\w* : 0개 이상의 연속된 단어 문자

문장내 각 단어의 축약 정보를 반환하는 generateContrMap

export type ExpandedToken = {
  id: string;
  token: string;
};

export type Contraction = {
  originalToken: string; // 원본 단어
  isContr: boolean; // 축약형 여부
  expandedTokens: ExpandedToken[]; // 축약을 해제한 문자열이 담긴 배열
  autoExpand: boolean; // 축약 자동 해제 여부
};

const ContractionPattern = /\b\w+'\w*\b/;

const makeExpandedContrToken = (token = '') => ({
  id: uuidv4(),
  token,
});

const makeExpandedContrTokens = (count = 2, token = '') => {
  return Array.from({ length: count }, () => makeExpandedContrToken(token));
};

const generateContrMap = (tokens: string[]): Contraction[] => {
  return tokens.map((word) => {
    const isContr = ContractionPattern.test(word);
    return {
      originalToken: word,
      isContr,
      expandedTokens: isContr ? makeExpandedContrTokens() : [],
      autoExpand: false,
    } satisfies Contraction;
  });
};

generateContrMap([
  "I'll",
  'make',
  'coffee',
  'and',
  "I've",
  'done',
  'my',
  'homework',
  '.',
]);

// generateContrMap 함수 반환값 Contraction[]
[
  {
    originalToken: "I'll",
    isContr: true,
    expandedTokens: [{ id: '...', token: '' }, { id: '...', token: '' }],
    autoExpand: false,
  },
  {
    originalToken: 'make',
    isContr: false,
    expandedTokens: [], // isContr 속성이 false이면 항상 빈배열
    autoExpand: false,
  },
  // ...
];

isContr 속성이 true이면 축약을 해제한 문자열이 담긴 배열을 expandedTokens 속성에 할당
isContr 속성이 false이면 expandedTokens 속성은 항상 빈 배열
expandedTokens[n].token 기본값은 '' 빈 문자열

축약을 펼친 문자열을 반환하는 getTokensWithExpandedContr

const getTokensWithExpandedContr = (contrMap: Contraction[]) => {
  return contrMap.reduce((result: string[], item: Contraction) => {
    if (item.isContr) {
      const expendedTokens = item.expandedTokens
        .map(({ token }) => token.trim())
        .filter((token) => token.length > 0);

      if (expendedTokens.length) return result.concat(expendedTokens);
    }

    return result.concat(item.originalToken);
  }, []);
};

// 축약 펼치기 전
["I'll", 'make', 'coffee', 'and', "I've", 'done', 'my', 'homework', '.']

// 축약 펼치기 후 (getTokensWithExpandedContr 함수 반환값)
['I', 'will', 'make', 'coffee', 'and', 'I', 'have', 'done', 'my', 'homework', '.']

축약 단어를 펼쳐주는 expandContractions

const CONTRACTIONS: Record<string, string> = {
  "'ll": ' will',
  "'ve": ' have',
  "'re": ' are',
  "'d": ' would',
  "'m": ' am',
  "'s": ' is',
  "can't": 'cannot',
  "couldn't": 'could not',
  "shouldn't": 'should not',
  "won't": 'will not',
  "wouldn't": 'would not',
  "doesn't": 'does not',
  "don't": 'do not',
  "didn't": 'did not',
  "n't": ' not',
  "ain't": 'am not', // or 'is not', 'are not', 'has not', 'have not' based on context
  "aren't": 'are not',
  "wasn't": 'was not',
  "weren't": 'were not',
  "hasn't": 'has not',
  "haven't": 'have not',
  "isn't": 'is not',
  "it's": 'it is', // or 'it has' based on context
  "i'm": 'I am',
  "i've": 'I have',
  "i'd": 'I would', // or 'I had' based on context
  "i'll": 'I will',
  "you're": 'you are',
  "you've": 'you have',
  "you'd": 'you would', // or 'you had' based on context
  "you'll": 'you will',
  "let's": 'let us',
  "he's": 'he is', // or 'he has' based on context
  "she's": 'she is', // or 'she has' based on context
  "they're": 'they are',
  "they've": 'they have',
  "they'd": 'they would', // or 'they had' based on context
  "they'll": 'they will',
};

const ContractionsPattern = new RegExp(
  Object.keys(CONTRACTIONS).join('|'),
  'g',
);

/* ContractionsPattern 반환값
/'ll|'ve|'re|'d|'m|'s|can't|couldn't|shouldn't|won't|wouldn't|.../g
*/

const expandContractions = (sentence: string, separator = ' ') => {
  return sentence
    .replace(ContractionsPattern, (match) => CONTRACTIONS[match])
    .split(separator);
};

expandContractions("I'll make coffee and I've done my homework.");
// ['I', 'will', 'make', 'coffee', 'and', 'I', 'have', 'done', 'my', 'homework.']

replace 메서드 2번째 인자 replacement 함수는 패턴에 일치하는 문자열을 발견할 때마다 호출
위 예시 기준 match 파라미터로 "'ll" 및 "'ve"가 전달돼서 2번 호출

글 수정사항은 노션 페이지에 가장 빠르게 반영됩니다. 링크를 참고해 주세요

저작자표시 비영리 변경금지 (새창열림)

'🪄 Programming' 카테고리의 다른 글

[Markdown] GitHub 마크다운 작성 꿀팁 모음 (0)	2024.05.23
[JS] 자바스크립트 ES2023 불변성 배열 메서드 톺아보기 (0)	2024.05.23
[Algorithm] 복잡한 DOM 예제로 보는 DFS 탐색 알고리즘 (0)	2024.05.22
[Algorithm] 데이터 추가, 삭제, 정렬로 보는 BFS / DFS 탐색 알고리즘 (0)	2024.05.21
[React/JS] 드래그한 문자열 분리(랩핑)하기 / Selection API (0)	2024.05.21

[JS] 영어 축약어 관련 유틸리티 함수 모음

참고 내용

축약은 분리 기준에서 제외하는 정규식

단어 문자가 아닌 문자열 기준으로 분리

단어 문자가 아니거나, 아포스트로피(')가 아닌 문자열 기준으로 분리

문장내 축약된 단어의 위치 인덱스를 반환하는 findContrIndexes

문장내 각 단어의 축약 정보를 반환하는 generateContrMap

축약을 펼친 문자열을 반환하는 getTokensWithExpandedContr

축약 단어를 펼쳐주는 expandContractions

'🪄 Programming' 카테고리의 다른 글

댓글

이 글 공유하기

티스토리툴바

참고 내용

축약은 분리 기준에서 제외하는 정규식

단어 문자가 아닌 문자열 기준으로 분리

단어 문자가 아니거나, 아포스트로피(')가 아닌 문자열 기준으로 분리

문장내 축약된 단어의 위치 인덱스를 반환하는 findContrIndexes

문장내 각 단어의 축약 정보를 반환하는 generateContrMap

축약을 펼친 문자열을 반환하는 getTokensWithExpandedContr

축약 단어를 펼쳐주는 expandContractions

'🪄 Programming' 카테고리의 다른 글

댓글

이 글 공유하기

다른 글

[Markdown] GitHub 마크다운 작성 꿀팁 모음

[JS] 자바스크립트 ES2023 불변성 배열 메서드 톺아보기

[Algorithm] 복잡한 DOM 예제로 보는 DFS 탐색 알고리즘

[Algorithm] 데이터 추가, 삭제, 정렬로 보는 BFS / DFS 탐색 알고리즘

티스토리툴바