利用出色的网页抓取功能让光标编辑器更加智能

我建造了什么

如今的 IDE 比如 Cursor 有一个 AI 代理来帮助我们更快更轻松地编码，但在这种情况下，AI 代理的缺点是 Composer 功能有限，例如，Cursor 中的聊天功能具有搜索网页的功能，如果我们将 URL 放入其中，它会自动抓取该网页并根据抓取的网页进行回答，而 Cursor 中名为 Composer 的 AI 代理不具备此功能。这是下面 Cursor 中的聊天功能，如您所见，有网页搜索

但如果你去光标编辑器，它没有这个功能，尤其是代理程序，

因此，我的目标是使用 bright 和 gemini openai 兼容模型创建一个网络抓取工具和网络搜索器，使光标编辑器更加智能，具有网络搜索和网络抓取等功能

演示

uratmangun / bright-scraper

网页爬虫

使用 Playwright 构建的强大的网页抓取实用程序，可以从网站提取内容和链接。

先决条件

Node.js 18+

pnpm（推荐）或 bun

安装和使用

您可以通过两种方式使用此工具：

1.使用 npx（推荐）

使用npx直接运行，无需安装：

npx @uratmangun/scraper-tool show-content 
# or
npx @uratmangun/scraper-tool search ""

2. 本地安装

安装依赖项：

pnpm install
# or
bun install

设置环境变量：

cp .env.example .env.local

然后编辑`.env.local`并为 Playwright CDP 连接设置`BRIGHT_PLAYWRIGHT_URL`。

命令

查看内容

# Using npx
npx @uratmangun/scraper-tool show-content  

# Using local installation
pnpm run scrape show-content

例子：

npx @uratmangun/scraper-tool show-content html https://example.com
# or
npx @uratmangun/scraper-tool show-content text https://example.com

这将显示指定 URL 的 HTML 或纯文本内容……

在 GitHub 上查看

如何使用 Bright Data

所以为了做到这一点我首先要编写一个脚本来使用 bright api web scraper：

Bright Data——网络数据平台

全球最大的代理服务，拥有全球 7200 万 IP 的住宅代理网络和零编码代理管理界面。开始 7 天免费试用 »

让我们使用以下脚本创建名为“scrape.mjs”的脚本：

查看原始 scrape.mjs GitHub

你可以像这样使用它：

pnpm run scrape show-content

它将显示如下 URL 的 HTML 内容：

...

这很好，但它仍然是非结构化的，在我们将其转换为机器和人类都可以理解的结构化数据之前，我们将使用 bright 的网络爬虫进行搜索，因此为“search.mjs”创建此脚本：

查看原始搜索.mjs GitHub

你可以使用“pnpm run search”来运行它` 并会显示如下内容：

...
AAAAAEAACgIQAAAAAACgAAAAAAAAAAAAAABIAAAAAAAAECAABEJCAAAEAAAAAMACAAAILAABAgAEAAAAAAAEAAgAIEAEYL__OAAAAAAAAAAAAAQCABEAAAAAAHABABAE0d4AAQAAAAgAAAAMAAAAQAAAAAAAAAUAAAAAAAAAAAQAAAAAAAAAAAAAAAABAPoBAAAAAAAAAAAAAAACAAAAAABggAIAAvgBAAAAAACAAwAAAAABAQAAOAIGIAAAAAAAAAD3AcDjAeGQwgIAAAAAAAAAAAAAAAABSBDMgfQXBCAAAAAAAAAAAAAAAAAAAJAiaOJyAwAC/d=0/dg=0/br=1/rs=ACT90oEbXjTDEsqDs2o3NzHTmzVZxjp5ng/m=sy27z,sy28k,sy27w,sy29c,sy28x,sy28v,sy288,sy280,M0O4le?xjs=s4" nonce="">

再次，这仍然是非结构化的，现在我们需要使用 AI 来解析所有这些，我正在使用 gemini 将这些内容解析为更易于解析的格式，例如 JSON，我们首先从搜索非结构化数据中提取链接，为此，让我们制作一些脚本来使用这些非结构化数据并使用 AI 将其转换为 JSON。

查看原始转换-search.mjs GitHub

现在当你运行它时你将会得到如下结果：

{
  content: '[\n' +
    '    {\n' +
    '        "description": "Order Panda Express | A Fast Casual Chinese Restaurant ...",\n' +
    '        "title": "Order Panda Express | A Fast Casual Chinese Restaurant ...",\n' +
    '        "url": "https://www.pandaexpress.com/"\n' +
    '    },\n' +
    '    {\n' +
    `        "description": "The only natural habitat for giant pandas in the world is located in southwestern China. Combined with the requirement that all cubs must return to China this creates the sense that pandas belong in and to China, and a country can only receive them if they have good relations with the People's Republic.",\n` +
    `        "title": "The Giant Pandas Have Left the National Zoo. What's Next for U.S. ...",\n` +
    `         "url": "https://www.georgetown.edu/news/the-giant-pandas-have-left-the-national-zoo-whats-next-for-u-s-china-relations/#:~:text=The%20only%20natural%20habitat%20for,relations%20with%20the%20People's%20Republic."\n` +
    '    },\n' +
    '    {\n' +
    '        "description": "Red pandas are the only living members of their taxonomic family, Ailuridae, while giant pandas are in the bear family, Ursidae.",\n' +
    '        "title": "Is a Red Panda a Bear? And More Red Panda Facts ...",\n' +
    '        "url": "https://nationalzoo.si.edu/animals/news/red-panda-bear-and-more-red-panda-facts"\n' +
    '    },\n' +
    '  {\n' +
    '        "description": "Giant pandas live in a few mountain ranges in south central China, in Sichuan, Shaanxi and Gansu provinces. They once lived in lowland areas, but farming, forest clearing and other development now restrict giant pandas to the mountains.",\n' +
    '        "title": "Giant panda",\n' +
    '        "url": "https://nationalzoo.si.edu/animals/giant-panda#:~:text=Giant%20pandas%20live%20in%20a,giant%20pandas%20to%20the%20mountains."\n' +
    '    },\n' +
    '     {\n' +
    `        "description": "Pandas have excellent camouflage for their habitat. The giant panda's distinct black-and-white markings have two functions: camouflage and communication. Most of the panda - its face, neck, belly, rump - is white to help it hide in snowy habitats. The arms and legs are black, helping it to hide in shade.",\n` +
    '        "title": "Top 10 facts about Pandas - WWF",\n' +
    '        "url": "https://www.wwf.org.uk/learn/fascinating-facts/pandas#:~:text=Pandas%20have%20excellent%20camouflage%20for,it%20to%20hide%20in%20shade."\n' +
    '    },\n' +
    '     {\n' +
    '        "description": "The giant panda (Ailuropoda melanoleuca), also known as the panda bear or simply panda, is a bear species endemic to China. It is characterised by its white coat with black patches around the eyes, ears, legs and shoulders. Its body is rotund; adult individuals weigh 100 to 115 kg and are typically 1.2 to 1.9 m long.",\n' +
    '        "title": "Giant panda",\n' +
    '        "url": "https://en.wikipedia.org/wiki/Giant_panda"\n' +
    '    },\n' +
    '   {\n' +
    `        "description": "The giant panda is the rarest member of the bear family and among the world's most threatened animals. Learn about WWF's giant panda conservation efforts.",\n` +
    '        "title": "Giant Panda | Species | WWF",\n' +
    '        "url": "https://www.worldwildlife.org/species/giant-panda"\n' +
    '    },\n' +
    '      {\n' +
    '        "description": "Panda Security antivirus: tailor-made computer security solutions. All our expertise to protect and simplify your life online.",\n' +
    '          "title": "Panda Security | Official Website",\n' +
    '        "url": "https://www.pandasecurity.com/"\n' +
    '    }\n' +
    ']',
  role: 'assistant'
}

解析后的 html 结果。现在我们需要将所有内容放入一个文件中，以便我们可以全局运行它，这样作曲家就可以使用它来使用搜索引擎获取更多信息。我创建了 `scrape-or-search.mjs` 文件，其中包含 `search.mjs` 和 `scrape.mjs` 的合并版本，如下所示：

查看原始抓取或搜索.mjs GitHub

我们还创建了 `bin/cli.js`，以便稍后全局运行命令行，如下所示 `npx @uratmangun/scraper-tool search` 或 `npx @uratmangun/scraper-tool scrape `：

查看原始 cli.js GitHub

然后我们需要相应地发布此更改我们的“package.json”：

查看原始包名.json GitHub

要将其发布到 npm，您需要分别运行以下两个命令：

npm login
npm publish --access public

这样你就可以使用“npx”来使用它，现在我们需要在“config.fish”中设置我们的全局环境变量，因为我使用 fish shell，你可以向 chatgpt 询问任何其他 shell，因此要设置“vi ~/.config/fish/config.fish”，然后我们将这两者都放进去：

set -gx BRIGHT_PLAYWRIGHT_URL 
set -gx GEMINI_API_KEY

运行 source ~/.config/fish/config.fish 以立即应用更改，然后我们可以全局使用它，这意味着当您运行此命令 `npx @uratmangun/scraper-tool search "web scraping tutorials"` 时，我们的光标编写器代理也可以使用它来搜索 Web，现在好了，要测试它是否正常工作，让我们首先创建一些项目，我们将使用名为 .cursorrules 的东西，因此 cursorrules 基本上有点像 AI 中的系统消息，所以在 AI 执行任何事情之前，他都会读取 cursorrules，所以让我们先尝试不使用 cursorrules，我们将使用 scaffold-eth 和 frog 创建一个以太坊项目，因此 frog 是一个 https://frog.fm/ 框架来创建 farcaster 框架，farcaster 框架是一个相对较新的框架，所以它可能不知道它是什么，我还将添加 https://docs.airstack.xyz/airstack-docs-and-faqs/farcaster/farcaster-frames/frames-validator，让我们创建新文件夹名为“testing-composer”