๋ฐ˜์‘ํ˜•

์ฐธ๊ณ  ๋‚ด์šฉ


์˜์–ด์—์„œ Contraction(์ถ•์•ฝ/๋‹จ์ถ•ํ˜•)๊ณผ Abbreviation(์•ฝ์–ด/์ถ•์•ฝ์–ด)๋Š” ๋‹ค๋ฅธ ๊ฐœ๋…์ด๋‹ค. 

 

  • Contraction: ๋‘ ๋‹จ์–ด๋ฅผ ํ•˜๋‚˜๋กœ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ผ๋ถ€ ๋ฌธ์ž๋ฅผ ์ƒ๋žตํ•˜๊ณ  ์•„ํฌ์ŠคํŠธ๋กœํ”ผ๋กœ ๋Œ€์ฒดํ•œ ํ˜•ํƒœ
    e.g. I willI'll, do notdon't
  • Abbreviation: ๋‹จ์–ด๋‚˜ ๊ตฌ์˜ ์ผ๋ถ€ ๋ฌธ์ž๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ค„์ธ ํ˜•ํƒœ. ์•„ํฌ์ŠคํŠธ๋กœํ”ผ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค
    e.g. United StatesU.S., DoctorDr.

 

์ถ•์•ฝ์€ ๋ถ„๋ฆฌ ๊ธฐ์ค€์—์„œ ์ œ์™ธํ•˜๋Š” ์ •๊ทœ์‹


๋‹จ์–ด ๋ฌธ์ž๊ฐ€ ์•„๋‹Œ ๋ฌธ์ž์—ด ๊ธฐ์ค€์œผ๋กœ ๋ถ„๋ฆฌ

// ๋‹จ์–ด ๋ฌธ์ž๊ฐ€ ์•„๋‹Œ ๋ฌธ์ž์—ด๊ณผ ์ผ์น˜
const NonWordCharPattern = /(\W)/g;
const sentence = "I'll make coffee and I've done my homework.";

// ๋‹จ์–ด ๋ฌธ์ž๊ฐ€ ์•„๋‹Œ ๋ฌธ์ž์—ด ์–‘์ชฝ์— ๊ณต๋ฐฑ ์ถ”๊ฐ€ e.g. '.' → ' . '
const replaced = sentence.replace(NonWordCharPattern, ' $1 ');
// "I ' ll   make   coffee   and   I ' ve   done   my   homework . "

replaced.split(/\s+/);
// ['I', "'", 'll', 'make', 'coffee', 'and', 'I', "'", 've', 'done', 'my', 'homework', '.', '']

 

  • \W ๋ฉ”ํƒ€๋ฌธ์ž๋Š” ๊ณต๋ฐฑ์„ ํฌํ•จํ•œ ๋‹จ์–ด๋ฌธ์ž(0-9a-zA-Z_)๊ฐ€ ์•„๋‹Œ ๊ฒƒ์„ ๊ฐ€๋ฆฌํ‚จ๋‹ค
  • $1์€ ์ฒซ๋ฒˆ์งธ ์บก์ฒ˜ ๊ทธ๋ฃน(์†Œ๊ด„ํ˜ธ)์„ ๊ฐ€๋ฆฌํ‚ด. ์œ„ ์˜ˆ์‹œ์—์„  ์‰ผํ‘œ ,์™€ ๊ณต๋ฐฑ์„ ํ•œ ๋ฒˆ์”ฉ ์ฐธ์กฐ
  • 'Hello, World'.replace(/(\W)/g, ' $1 ')'Hello ,    World'
    • ์ฒซ๋ฒˆ์งธ $1 ์ฐธ์กฐ๊ฐ’ : , , 
    • ๋‘๋ฒˆ์งธ $1 ์ฐธ์กฐ๊ฐ’ : ๊ณต๋ฐฑ 1๊ฐœ → ๊ณต๋ฐฑ 3๊ฐœ

 

๋‹จ์–ด ๋ฌธ์ž๊ฐ€ ์•„๋‹ˆ๊ฑฐ๋‚˜, ์•„ํฌ์ŠคํŠธ๋กœํ”ผ(')๊ฐ€ ์•„๋‹Œ ๋ฌธ์ž์—ด ๊ธฐ์ค€์œผ๋กœ ๋ถ„๋ฆฌ

// ๋‹จ์–ด ๋ฌธ์ž๋‚˜ ์•„ํฌ์ŠคํŠธ๋กœํ”ผ(')๊ฐ€ ์•„๋‹Œ ๋ชจ๋“  ๋ฌธ์ž์™€ ์ผ์น˜
const NonWordCharPattern = /([^\w'])/g;

// ๋‹จ์–ด ๋ฌธ์ž๋‚˜ ์•„ํฌ์ŠคํŠธ๋กœํ”ผ๊ฐ€ ์•„๋‹Œ ๋ฌธ์ž์—ด ์–‘์ชฝ์— ๊ณต๋ฐฑ ์ถ”๊ฐ€ e.g. ' ' → '   '
const replaced = "I'll make coffee and I've done my homework.".replace(NonWordCharPattern, " $1 ");
// "I'll   make   coffee   and   I've   done   my   homework . "

replaced.split(/\s+/);
// ["I'll", 'make', 'coffee', 'and', "I've", 'done', 'my', 'homework', '.', '']

 

  • /([^\w'])/g : ๋‹จ์–ด ๋ฌธ์ž๋‚˜ ์•„ํฌ์ŠคํŠธ๋กœํ”ผ(')๊ฐ€ ์•„๋‹Œ ๋ชจ๋“  ๋ฌธ์ž์™€ ์ผ์น˜ (๊ณต๋ฐฑ, ์‰ผํ‘œ ๋“ฑ)
    • [] : ๋ฌธ์ž ๊ทธ๋ฃน. ๋Œ€๊ด„ํ˜ธ์— ์žˆ๋Š” ๋ฌธ์ž์—ด ์ค‘ ํ•˜๋‚˜๋ผ๋„ ์ผ์น˜ํ•˜๋ฉด ๋งค์นญ
    • [^] : ๋ถ€์ • ๋ฌธ์ž ๊ทธ๋ฃน. ๋Œ€๊ด„ํ˜ธ์˜ ์‹œ์ž‘์ด ์บ๋Ÿฟ(^) ์ผ ๋• ๋Œ€๊ด„ํ˜ธ์— ํ•ด๋‹นํ•˜์ง€ ์•Š๋Š” ๋ฌธ์ž์—ด๋งŒ ๋งค์นญ
  • split(/\s+/) : ํ•˜๋‚˜ ์ด์ƒ์˜ ์—ฐ์†๋œ ๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๋ฆฌ
    • \s : ๊ณต๋ฐฑ ๋ฌธ์ž
    • + : 1๋ฒˆ ์ด์ƒ ์ผ์น˜
    • 'Hello ,    World'.split(/\s+/)['Hello', ',', 'World']
      1. ์ฒซ๋ฒˆ์งธ ๊ณต๋ฐฑ์œผ๋กœ ๋ถ„๋ฆฌ ํ›„ ['Hello', ',    World']
      2. ๋‘๋ฒˆ์งธ ๊ณต๋ฐฑ์œผ๋กœ ๋ถ„๋ฆฌ ํ›„ ['Hello', ',' ,'World']

 

๋ฌธ์žฅ๋‚ด ์ถ•์•ฝ๋œ ๋‹จ์–ด์˜ ์œ„์น˜ ์ธ๋ฑ์Šค๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” findContrIndexes


const ContractionPattern = /\b\w+'\w*\b/;

const findContrIndexes = (arr: string[]) => {
  return arr.reduce((acc: number[], cur, i) => {
    return ContractionPattern.test(cur) ? acc.concat(i) : acc;
  }, []);
};

findContrIndexes([
  "I'll",
  'make',
  'coffee',
  'and',
  "I've",
  'done',
  'my',
  'homework',
  '.',
]);
// ๋ฐ˜ํ™˜๊ฐ’ [0, 4]

 

I've, can't ๋“ฑ ๋‹ค์–‘ํ•œ ์ถ•์•ฝ ์ผ€์ด์Šค๋ฅผ ์‹๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด /\b\w+'\w*\b/ ์ •๊ทœ์‹ ์‚ฌ์šฉ. \w+' ๋Š” ์•„ํฌ์ŠคํŠธ๋กœํ”ผ(') ๊ธฐ์ค€ ์•ž์— ์žˆ๋Š” ๋ถ€๋ถ„์ด๊ณ , \w* ๋Š” ๋’ค์— ์žˆ๋Š” ๋ถ€๋ถ„.

  • \b : ๋‹จ์–ด ๊ฒฝ๊ณ„(์•ž ํ˜น์€ ๋’ค์— ๋‹ค๋ฅธ ๋‹จ์–ด ๋ฌธ์ž๊ฐ€ ๋“ฑ์žฅํ•˜์ง€ ์•Š๋Š” ์œ„์น˜)
  • \w+' : 1๊ฐœ ์ด์ƒ์˜ ์—ฐ์†๋œ ๋‹จ์–ด ๋ฌธ์ž ๋’ค์— ์•„ํฌ์ŠคํŠธ๋กœํ”ผ๊ฐ€ ์žˆ๋Š” ๋ฌธ์ž์™€ ์ผ์น˜
  • \w* : 0๊ฐœ ์ด์ƒ์˜ ์—ฐ์†๋œ ๋‹จ์–ด ๋ฌธ์ž

 

๋ฌธ์žฅ๋‚ด ๊ฐ ๋‹จ์–ด์˜ ์ถ•์•ฝ ์ •๋ณด๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” generateContrMap


๋”๋ณด๊ธฐ
export type ExpandedToken = {
  id: string;
  token: string;
};

export type Contraction = {
  originalToken: string; // ์›๋ณธ ๋‹จ์–ด
  isContr: boolean; // ์ถ•์•ฝํ˜• ์—ฌ๋ถ€
  expandedTokens: ExpandedToken[]; // ์ถ•์•ฝ์„ ํ•ด์ œํ•œ ๋ฌธ์ž์—ด์ด ๋‹ด๊ธด ๋ฐฐ์—ด
  autoExpand: boolean; // ์ถ•์•ฝ ์ž๋™ ํ•ด์ œ ์—ฌ๋ถ€
};
const ContractionPattern = /\b\w+'\w*\b/;

const makeExpandedContrToken = (token = '') => ({
  id: uuidv4(),
  token,
});

const makeExpandedContrTokens = (count = 2, token = '') => {
  return Array.from({ length: count }, () => makeExpandedContrToken(token));
};

const generateContrMap = (tokens: string[]): Contraction[] => {
  return tokens.map((word) => {
    const isContr = ContractionPattern.test(word);
    return {
      originalToken: word,
      isContr,
      expandedTokens: isContr ? makeExpandedContrTokens() : [],
      autoExpand: false,
    } satisfies Contraction;
  });
};

generateContrMap([
  "I'll",
  'make',
  'coffee',
  'and',
  "I've",
  'done',
  'my',
  'homework',
  '.',
]);
// generateContrMap ํ•จ์ˆ˜ ๋ฐ˜ํ™˜๊ฐ’ Contraction[]
[
  {
    originalToken: "I'll",
    isContr: true,
    expandedTokens: [{ id: '...', token: '' }, { id: '...', token: '' }],
    autoExpand: false,
  },
  {
    originalToken: 'make',
    isContr: false,
    expandedTokens: [], // isContr ์†์„ฑ์ด false์ด๋ฉด ํ•ญ์ƒ ๋นˆ๋ฐฐ์—ด
    autoExpand: false,
  },
  // ...
];

 

  • isContr ์†์„ฑ์ด true์ด๋ฉด ์ถ•์•ฝ์„ ํ•ด์ œํ•œ ๋ฌธ์ž์—ด์ด ๋‹ด๊ธด ๋ฐฐ์—ด์„ expandedTokens ์†์„ฑ์— ํ• ๋‹น
  • isContr ์†์„ฑ์ด false์ด๋ฉด expandedTokens ์†์„ฑ์€ ํ•ญ์ƒ ๋นˆ ๋ฐฐ์—ด
  • expandedTokens[n].token ๊ธฐ๋ณธ๊ฐ’์€ '' ๋นˆ ๋ฌธ์ž์—ด

 

์ถ•์•ฝ์„ ํŽผ์นœ ๋ฌธ์ž์—ด์„ ๋ฐ˜ํ™˜ํ•˜๋Š” getTokensWithExpandedContr


const getTokensWithExpandedContr = (contrMap: Contraction[]) => {
  return contrMap.reduce((result: string[], item: Contraction) => {
    if (item.isContr) {
      const expendedTokens = item.expandedTokens
        .map(({ token }) => token.trim())
        .filter((token) => token.length > 0);

      if (expendedTokens.length) return result.concat(expendedTokens);
    }

    return result.concat(item.originalToken);
  }, []);
};
// ์ถ•์•ฝ ํŽผ์น˜๊ธฐ ์ „
["I'll", 'make', 'coffee', 'and', "I've", 'done', 'my', 'homework', '.']

// ์ถ•์•ฝ ํŽผ์น˜๊ธฐ ํ›„ (getTokensWithExpandedContr ํ•จ์ˆ˜ ๋ฐ˜ํ™˜๊ฐ’)
['I', 'will', 'make', 'coffee', 'and', 'I', 'have', 'done', 'my', 'homework', '.']

 

์ถ•์•ฝ ๋‹จ์–ด๋ฅผ ํŽผ์ณ์ฃผ๋Š” expandContractions


const CONTRACTIONS: Record<string, string> = {
  "'ll": ' will',
  "'ve": ' have',
  "'re": ' are',
  "'d": ' would',
  "'m": ' am',
  "'s": ' is',
  "can't": 'cannot',
  "couldn't": 'could not',
  "shouldn't": 'should not',
  "won't": 'will not',
  "wouldn't": 'would not',
  "doesn't": 'does not',
  "don't": 'do not',
  "didn't": 'did not',
  "n't": ' not',
  "ain't": 'am not', // or 'is not', 'are not', 'has not', 'have not' based on context
  "aren't": 'are not',
  "wasn't": 'was not',
  "weren't": 'were not',
  "hasn't": 'has not',
  "haven't": 'have not',
  "isn't": 'is not',
  "it's": 'it is', // or 'it has' based on context
  "i'm": 'I am',
  "i've": 'I have',
  "i'd": 'I would', // or 'I had' based on context
  "i'll": 'I will',
  "you're": 'you are',
  "you've": 'you have',
  "you'd": 'you would', // or 'you had' based on context
  "you'll": 'you will',
  "let's": 'let us',
  "he's": 'he is', // or 'he has' based on context
  "she's": 'she is', // or 'she has' based on context
  "they're": 'they are',
  "they've": 'they have',
  "they'd": 'they would', // or 'they had' based on context
  "they'll": 'they will',
};
const ContractionsPattern = new RegExp(
  Object.keys(CONTRACTIONS).join('|'),
  'g',
);

/* ContractionsPattern ๋ฐ˜ํ™˜๊ฐ’
/'ll|'ve|'re|'d|'m|'s|can't|couldn't|shouldn't|won't|wouldn't|.../g
*/
const expandContractions = (sentence: string, separator = ' ') => {
  return sentence
    .replace(ContractionsPattern, (match) => CONTRACTIONS[match])
    .split(separator);
};

expandContractions("I'll make coffee and I've done my homework.");
// ['I', 'will', 'make', 'coffee', 'and', 'I', 'have', 'done', 'my', 'homework.']

 

  • replace ๋ฉ”์„œ๋“œ 2๋ฒˆ์งธ ์ธ์ž replacement ํ•จ์ˆ˜๋Š” ํŒจํ„ด์— ์ผ์น˜ํ•˜๋Š” ๋ฌธ์ž์—ด์„ ๋ฐœ๊ฒฌํ•  ๋•Œ๋งˆ๋‹ค ํ˜ธ์ถœ
  • ์œ„ ์˜ˆ์‹œ ๊ธฐ์ค€ match ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ "'ll" ๋ฐ "'ve"๊ฐ€ ์ „๋‹ฌ๋ผ์„œ 2๋ฒˆ ํ˜ธ์ถœ

๊ธ€ ์ˆ˜์ •์‚ฌํ•ญ์€ ๋…ธ์…˜ ํŽ˜์ด์ง€์— ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ ๋ฐ˜์˜๋ฉ๋‹ˆ๋‹ค. ๋งํฌ๋ฅผ ์ฐธ๊ณ ํ•ด ์ฃผ์„ธ์š”
๋ฐ˜์‘ํ˜•