๋ฐ˜์‘ํ˜•

split() ๋ฉ”์„œ๋“œ์˜ ๋ฌธ์ œ์ 


์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ์—์„œ ๋ฌธ์ž์—ด์„ ๋ถ„๋ฆฌํ•  ๋•Œ ์ผ๋ฐ˜์ ์œผ๋กœ split() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์•„๋ž˜์ฒ˜๋Ÿผ ๊ตฌ๋ถ„์ž๋Š” ๊ฒฐ๊ณผ ๋ฐฐ์—ด์—์„œ ์ œ์™ธ๋˜๊ณ  ๋ถˆํ•„์š”ํ•œ ๊ณต๋ฐฑ์ด ์ถ”๊ฐ€๋œ๋‹ค. ์ฆ‰, language-sensitive ํ•˜์ง€ ์•Š๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. language-sensitive๋Š” ๋ฌธ๋งฅ์— ๋งž๋Š” ํ‘œํ˜„๊ณผ ์šฉ์–ด ์‚ฌ์šฉ์„ ์˜๋ฏธํ•œ๋‹ค. ํ•œ๊ตญ์–ด๋กœ๋Š” ์–ธ์–ด ๊ฐ์ˆ˜์„ฑ์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. 

'Hello! How are you?'.split(/[.!?]/);
// ['Hello', ' How are you', '']

 

๐Ÿ’ก ์ •๊ทœ์‹ ๊ด€๋ จ ์ฐธ๊ณ 

  • ํ‘œํ˜„์‹์„ ํ•˜๋‚˜์˜ ๋‹จ์œ„๋กœ ์ทจ๊ธ‰ํ•˜๋Š” ํฌํš ๊ด„ํ˜ธ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ตฌ๋ถ„์ž๋„ ๊ฒฐ๊ณผ ๋ฐฐ์—ด์— ํฌํ•จํ•  ์ˆ˜ ์žˆ๋‹ค
  • [] ๋ฌธ์ž ๊ทธ๋ฃน์€ ๋Œ€๊ด„ํ˜ธ ๋‚ด๋ถ€ ๋ฌธ์ž์—ด ์ค‘ ํ•˜๋‚˜๋ผ๋„ ์ผ์น˜ํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ์˜๋ฏธํ•œ๋‹ค

 

Intl ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ API๋ฅผ ํ™œ์šฉํ•˜๋ฉด ์œ„ ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ๋ง๋”ํ•˜๊ฒŒ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.

 

Intl.Segmenter


Intl.Segmenter ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ฌธ์ž์—ด์„ ์˜๋ฏธ์žˆ๋Š” ๋‹จ์œ„๋กœ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. locale, granularity๋ฅผ ์ •์˜ํ•œ ํ›„ segment ๋ฉ”์„œ๋“œ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋ถ„๋ฆฌํ•˜๊ณ  ์‹ถ์€ ๋ฌธ์ž์—ด์„ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.

 

๐Ÿ’ก granularity ์˜ต์…˜(้ข—็ฒ’้€‰้กน) ์ข…๋ฅ˜

  • sentence: ๋ฌธ์žฅ ๋‹จ์œ„๋กœ ๋ถ„๋ฆฌ
  • word: ๋‹จ์–ด ๋‹จ์œ„๋กœ ๋ถ„๋ฆฌ
  • grapheme: ๋ฌธ์ž์†Œ(์˜๋ฏธ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ตœ์†Œ ๋‹จ์œ„) ๋‹จ์œ„๋กœ ๋ถ„๋ฆฌ

 

segment ๋ฉ”์„œ๋“œ๋Š” ์ •์˜ํ•œ locale, granularity ์˜ต์…˜์— ๋”ฐ๋ผ ๋ฌธ์ž์—ด์˜ ์„ธ๊ทธ๋จผํŠธ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” Segments ์ธ์Šคํ„ด์Šค๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค. Segments ์ธ์Šคํ„ด์Šค๋Š” ์ดํ„ฐ๋Ÿฌ๋ธ”์ด๋‹ค. ๋”ฐ๋ผ์„œ ์ „๊ฐœ์—ฐ์‚ฐ์ž, Array.from, for of ๋ฌธ ๋“ฑ์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

 

granularity: 'word' ์ผ ๋•Œ โ–ผ

const segmenterKo = new Intl.Segmenter('ko', { granularity: 'word' });
const segmentsKo = segmenterKo.segment('์•ˆ๋…•ํ•˜์„ธ์š”, ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค');

console.log([...segmentsKo]);
/*
{ segment: '์•ˆ๋…•ํ•˜์„ธ์š”', index: 0, input: '์•ˆ๋…•ํ•˜์„ธ์š”, ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค', isWordLike: true }
{ segment: ',', index: 5, input: '์•ˆ๋…•ํ•˜์„ธ์š”, ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค', isWordLike: false }
{ segment: ' ', index: 6, input: '์•ˆ๋…•ํ•˜์„ธ์š”, ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค', isWordLike: false }
{ segment: '๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค', index: 7, input: '์•ˆ๋…•ํ•˜์„ธ์š”, ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค', isWordLike: true }
*/

 

granularity: 'sentence' ์ผ ๋•Œ โ–ผ

const segmenterKo = new Intl.Segmenter('ko', { granularity: 'sentence' });
const segmentsKo = segmenterKo.segment('์•ˆ๋…•ํ•˜์„ธ์š”, ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค');

console.log([...segmentsKo]);
/*
{ "segment": "์•ˆ๋…•ํ•˜์„ธ์š”, ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค", "index": 0, "input": "์•ˆ๋…•ํ•˜์„ธ์š”, ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค" }
*/

 

granularity: 'grapheme' ์ผ ๋•Œ โ–ผ

const segmenterKo = new Intl.Segmenter('ko', { granularity: 'grapheme' });
const segmentsKo = segmenterKo.segment('์•ˆ๋…•ํ•˜์„ธ์š”, ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค');

console.log([...segmentsKo]);
/*
{ "segment": "์•ˆ", "index": 0, "input": "์•ˆ๋…•ํ•˜์„ธ์š”, ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค" }
{ "segment": "๋…•", "index": 1, "input": "์•ˆ๋…•ํ•˜์„ธ์š”, ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค" }
{ "segment": "ํ•˜", "index": 2, "input": "์•ˆ๋…•ํ•˜์„ธ์š”, ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค" }
...
*/

 

isWordLike ํ”„๋กœํผํ‹ฐ


๐Ÿ’ก isWordLike ํ”„๋กœํผํ‹ฐ๋Š” word ๊ธฐ์ค€์œผ๋กœ ๋ฌธ์ž์—ด์„ ๋ถ„๋ฆฌํ–ˆ์„ ๋•Œ๋งŒ ํฌํ•จ๋œ๋‹ค

 

๋งŒ์•ฝ word๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฌธ์ž์—ด์„ ๋ถ„๋ฆฌํ–ˆ๋‹ค๋ฉด, ๋ชจ๋“  ์„ธ๊ทธ๋จผํŠธ๋Š” ๋‹จ์–ด๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ณต๋ฐฑ(whitespace), ๊ตฌ๋‘์ (punctuation), ์ค„๋ฐ”๊ฟˆ(line breaks)์„ ํฌํ•จํ•  ์ˆ˜ ์žˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋‹จ์–ด๋งŒ ์ถ”์ถœํ•˜๊ณ  ์‹ถ์œผ๋ฉด isWordLike ํ”„๋กœํผํ‹ฐ๋ฅผ ์ด์šฉํ•ด ํ•„ํ„ฐ๋ง ํ•˜๋ฉด ๋œ๋‹ค. isWordLike๋Š” ์„ธ๊ทธ๋จผํŠธ๊ฐ€ ๋‹จ์–ด์™€ ์œ ์‚ฌํ•œ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ถˆ๋ฆฌ์–ธ ๊ฐ’์ด๋‹ค. 

const segmenterKo = new Intl.Segmenter('ko', { granularity: 'word' });
const segmentsKo = segmenterKo.segment('์•ˆ๋…•ํ•˜์„ธ์š”, ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค');

console.log([...segmentsKo].filter((s) => s.isWordLike));
/*
{ "segment": "์•ˆ๋…•ํ•˜์„ธ์š”", "index": 0, "input": "์•ˆ๋…•ํ•˜์„ธ์š”, ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค", "isWordLike": true }
{ "segment": "๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค", "index": 7, "input": "์•ˆ๋…•ํ•˜์„ธ์š”, ๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค", "isWordLike": true }
*/

 

์ด๋ชจ์ง€ ๋ถ„๋ฆฌํ•˜๊ธฐ


์ด๋ชจ์ง€๋กœ ๊ตฌ์„ฑ๋œ ๋ฌธ์ž์—ด์— split() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์œ ๋‹ˆ์ฝ”๋“œ ๋ฌธ์ž์—ด์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค. ์ด๋ชจ์ง€ ๋ฌธ์ž์—ด์— ์ „๊ฐœ ์—ฐ์‚ฐ์ž๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์กฐํ•ฉํ˜• ์ด๋ชจ์ง€์˜ ๊ฒฝ์šฐ(๐Ÿ‘จ‍๐Ÿ‘จ‍๐Ÿ‘ฆ‍๐Ÿ‘ฆ) ๊ฐ ์ด๋ชจ์ง€๊ฐ€ ๋ชจ๋‘ ๋ถ„๋ฆฌ๋ผ์„œ ๋‚˜์˜ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.

const emojis = '๐Ÿซฃ๐Ÿซต๐Ÿ‘จ‍๐Ÿ‘จ‍๐Ÿ‘ฆ‍๐Ÿ‘ฆ';
console.log(emojis.split('')); // Split by code units
// ['\uD83E', '\uDEE3', '\uD83E', '\uDEF5', '\uD83D', '\uDC68', '‍', ...]

console.log([...emojis]); // Split by code points
// ['๐Ÿซฃ', '๐Ÿซต', '๐Ÿ‘จ', '‍', '๐Ÿ‘จ', '‍', '๐Ÿ‘ฆ', '‍', '๐Ÿ‘ฆ']

 

์œ ๋‹ˆ์ฝ”๋“œ ๋ฌธ์ž์—ด์ด ์•„๋‹Œ ์‹œ๊ฐ์  ์ด๋ชจ์ง€๋ฅผ ๊ทธ๋Œ€๋กœ ํ‘œ์‹œํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด Intl.Segmenter ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค.

const emojis = '๐Ÿซฃ๐Ÿซต๐Ÿ‘จ‍๐Ÿ‘จ‍๐Ÿ‘ฆ‍๐Ÿ‘ฆ';

const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const segments = segmenter.segment(emojis);

console.log(Array.from(segments, (s) => s.segment)); // ['๐Ÿซฃ', '๐Ÿซต', '๐Ÿ‘จ‍๐Ÿ‘จ‍๐Ÿ‘ฆ‍๐Ÿ‘ฆ']

 

๐Ÿ’ก Intl Explorer ์‚ฌ์ดํŠธ์—์„œ API๋ฅผ ์ง์ ‘ ํ…Œ์ŠคํŠธํ•˜๋ฉด์„œ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค(Playground)

 

๋ ˆํผ๋Ÿฐ์Šค


 

How to split JavaScript strings into sentences, words or graphemes with "Intl.Segmenter"

`Intl.Segmenter` enables you to split strings into meaningful parts such as words, sentences and graphemes.

www.stefanjudis.com

 


๊ธ€ ์ˆ˜์ •์‚ฌํ•ญ์€ ๋…ธ์…˜ ํŽ˜์ด์ง€์— ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ ๋ฐ˜์˜๋ฉ๋‹ˆ๋‹ค. ๋งํฌ๋ฅผ ์ฐธ๊ณ ํ•ด ์ฃผ์„ธ์š”
๋ฐ˜์‘ํ˜•